locked
How to convert MS DOS string to UNICODE String. RRS feed

  • Question

  • Hi,

    I am trying to convert the MS DOS string(reading a widechar string from _wpopen() function from a pipe) to unicode.

    If MS DOS string contains english chars it is working fine. but if MS DOS string contains non english(in my case german), i am facing problem.

    I have a string(Gerätetyp) from pipe, when i  tried to compare with constant "Gerätetyp" comparison is failing..

    below is the sample code.


    TCHAR szOutput[1024] = {0};

    TCHAR szTemp[MAX_PATH] = {0};

    FILE *fp = _wopen(L"netsh mbn show int");

    if(fp != NULL)

    {

      while(fgetws(szTemp,fp))

    {

    _tcscat(szOutput, szTemp);

    }

    // szOutput contains a string Gerätetyp. i am able to see the string while i am debug

    // according to wikipedia http://en.wikipedia.org/wiki/%C3%84

    //The Unicode code point for ä is U+00E4. Ä is U+00C4. The Windows-1252 code for ä is 228, //for Ä 196. In MS-DOS, ä is 132, and Ä is 142.

    if(_tcsstr(szOutput, L"Gerätetyp")) // while debugging i see that szOutput had ä value of132.

    {

    printf("found");

    }

    else

    {

    printf(" not found");

    }

    }


    Please help on this.


    Windows Desktop, Windows Phone developer.

    Tuesday, October 16, 2012 1:57 PM

Answers

  • On 17/10/2012 11:03, Venkatachalapathi G wrote:


    In my case, The output of netsh command when i run in command prompt is like below :

         Gerätetyp        : In das System ist ein mobiles*Breitbandgerät* eingebettet.

    [...]


    but i tried to push the same data in to text file. the text file contained like this.

         *Ger„tetyp*        : In das System ist ein mobiles*Breitbandger„t* eingebettet.

    Seems like the ä :
       Unicode Character 'LATIN SMALL LETTER A WITH DIAERESIS' (U+00E4)
       http://www.fileformat.info/info/unicode/char/E4/index.htm

    is wrongly mapped to „ :
       Unicode Character 'DOUBLE LOW-9 QUOTATION MARK' (U+201E)
       http://www.fileformat.info/info/unicode/char/201e/index.htm

    There seems to be a code page mismatch.

    Note that "a with diaeresis" (U+00E4) is represented by 0x84 in code page 850 (MS-DOS Latin-1).
    But in Windows code page 1252 (Windows Latin-1), 0x84 is associated to "double low-9 quotation mark" (U+201E).

    I suggest you to pay attention on how you convert text.
    If the input is in code page 850, use MultiByteToWideChar (or CA2W) with code page identifier 850 to get the proper Unicode string.

    Giovanni

    Wednesday, October 17, 2012 10:00 AM

All replies

  • If the string that comes from the pipe uses Windows 1252 encoding (or any other ansi/multibyte encoding) then you can't read it with fgetws and store it in a wchar_t array. Use a char array and fgets instead.

    Once you get the multibyte string from the pipe you can use mbstowcs or MultiByteToWideChar functions to convert it to a Unicode string.

    Tuesday, October 16, 2012 2:13 PM
  • even i tried that to... I used _popen() rather than _wpopen() and read it in char array.

    I tried to convert using MultiByteToWideChar() to get the unicode version of the string. Still the comparision is failing...


    Windows Desktop, Windows Phone developer.

    Tuesday, October 16, 2012 2:18 PM
  • Mike Danes wrote:

    Once you get the multibyte string from the pipe you can use mbstowcs  or MultiByteToWideChar functions to convert it to a Unicode
    string.

    ... using CP_OEMCP codepage, in this case.


    Igor Tandetnik

    Tuesday, October 16, 2012 2:21 PM
  • tried using CP_OEMCP code page also. I tried with multiple code pages also.. but its failing..

    Windows Desktop, Windows Phone developer.

    Tuesday, October 16, 2012 2:49 PM
  • Changing wpopen to popen doesn't do anything in your case:

    "wpopen is a wide-character version of _popen; the path argument to _wpopen is a wide-character string. _wpopen and _popen behave identically otherwise." - http://msdn.microsoft.com/en-us/library/96ayss4b.aspx

    Your problem is reading from the pipe and that's fgets/fgetws.

    And maybe it's worth asking why are you trying to parse the output of netsh. Isn't there some Windows API that can provide the same information? Perhaps this: http://msdn.microsoft.com/en-us/library/windows/desktop/dd323271(v=vs.85).aspx

    Tuesday, October 16, 2012 2:55 PM
  • On 10/16/2012 10:49 AM, Venkatachalapathi G wrote:

    tried using CP_OEMCP code page also. I tried with multiple code pages also.. but its failing..

    Show your code, describe the results you observe, and how they differ from your expectations.


    Igor Tandetnik

    Tuesday, October 16, 2012 3:49 PM
  • I wanted to know the form factor of a mobile broadband device.. There is no API which provides this info. "netsh mbn show int" command gives the device type, which i wanted to parse and get the form factor.

    Windows Desktop, Windows Phone developer.

    Tuesday, October 16, 2012 6:24 PM
  • Here is my code.
    // testsample.cpp : Defines the entry point for the console application.
    //
    
    #include "stdafx.h"
    #include "windows.h"
    #include "stdio.h"
    #include "string.h"
    #include "conio.h"
    #include "locale.h"
    
    int _tmain(int argc, _TCHAR* argv[])
    {
    	char szOutOfNetsh[1024] = {0};
    	char szTemp[1024*8]  = {0};
    	TCHAR szTempWideString[1024*8] = {0};
    	//setlocale( LC_CTYPE, "English_United States.1252" );
    	FILE *fp = _popen("netsh mbn show int", "rt");
    	if(fp != NULL)
    	{
    		while(fgets(szOutOfNetsh,1024, fp))
    		{		
    			//printf(szOutOfNetsh);
    			strcat(szTemp,szOutOfNetsh);
    		}	
        // utf8 is the pointer to your UTF-8 string
    		char* utf8 = (char*)szTemp;
        // convert multibyte UTF-8 to wide string UTF-16
        int length = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1, NULL, 0);
        if (length > 0)
        {
            wchar_t* wide = new wchar_t[length];
            MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1, wide, length);
    	_tprintf(wide);
            if(_tcsstr(wide,L"Gerätetyp"))
    	{		   
                printf("\n********found***********\n");
    	}
            delete wide;
         }	
    	fclose(fp);
        }
    	return 0;
    }
    
    

            

    Windows Desktop, Windows Phone developer.

    Wednesday, October 17, 2012 6:18 AM
  • On 17/10/2012 08:18, Venkatachalapathi G wrote:

    Here is my code.

    You miss the other two points Igor mentioned:
      - describe the results you observe
      - and how they differ from your expectations.
     >

    [code]
         // convert multibyte UTF-8 to wide string UTF-16
         int length = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1, NULL, 0);

    Note that you don't need the (LPCSTR) cast.
    LPCSTR just means 'const char*': it works fine for 'utf8' char* variable.

    Maybe your problem is in the code page? Maybe you should use something different, like '1252' instead of 'CP_UTF8'?

    Moreover, you may want to check the case of length == 0 with a call to GetLastError().

         if (length > 0)
         {
             wchar_t* wide = new wchar_t[length];
             MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1, wide, length);
        _tprintf(wide);
             if(_tcsstr(wide,L"Gerätetyp"))
        {                     printf("\n********found***********\n");
        }
             delete wide;

    The above should be 'delete[] wide;'.

    Note that in modern C++ it would be better to use std::wstring or std::vector<wchar_t>, instead of raw new[]/delete[].

    Note also that if you can use ATL, there are convenient helper classes for the conversion, like CA2W, which wraps MultiByteToWideChar() calls.

    Giovanni

    Wednesday, October 17, 2012 7:53 AM
  • thanks for ur suggestions.. In my original code i am not using pointers, i directly using STL functions. just for easy understanding i wrote the sample code.

    In my case, The output of netsh command when i run in command prompt is like below :

    Auf dem System ist 1 Schnittstelle vorhanden:

        Name               : Mobile Breitbandverbindung
        Beschreibung        : Sierra Wireless MC8355 - Gobi 3000 (TM) HS-USB Mobile Broadband Device 9013
        GUID               : {8C5DD47B-CA04-4249-A99E-FF7EAA56D891}
        Physische Adresse   : 00:a0:c6:00:00:00
        Status              : Nicht verbunden
        Gerätetyp        : In das System ist ein mobiles Breitbandgerät eingebettet.
        Mobilklasse     : CDMA
        Geräte-ID          : A1000004B85EC5
        Hersteller       : Sierra Wireless Inc
        Modell              : Sierra Wireless MC8355 - Gobi 3
        Firmwareversion   : D3600-STSUVH-1576 0d010008
        Anbietername      :
        Roaming            : Ja
        Signal             : 64

    but i tried to push the same data in to text file. the text file contained like this.

    Auf dem System ist 1 Schnittstelle vorhanden:

        Name               : Mobile Breitbandverbindung
        Beschreibung        : Sierra Wireless MC8355 - Gobi 3000 (TM) HS-USB Mobile Broadband Device 9013
        GUID               : {8C5DD47B-CA04-4249-A99E-FF7EAA56D891}
        Physische Adresse   : 00:a0:c6:00:00:00
        Status              : Nicht verbunden
        Ger„tetyp        : In das System ist ein mobiles Breitbandger„t eingebettet.
        Mobilklasse     : CDMA
        Ger„te-ID          : A1000004B85EC5
        Hersteller       : Sierra Wireless Inc
        Modell              : Sierra Wireless MC8355 - Gobi 3
        Firmwareversion   : D3600-STSUVH-1576 0d010008
        Anbietername      :
        Roaming            : Ja
        Signal             : 64


    Windows Desktop, Windows Phone developer.

    Wednesday, October 17, 2012 9:03 AM
  • On 17/10/2012 11:03, Venkatachalapathi G wrote:


    In my case, The output of netsh command when i run in command prompt is like below :

         Gerätetyp        : In das System ist ein mobiles*Breitbandgerät* eingebettet.

    [...]


    but i tried to push the same data in to text file. the text file contained like this.

         *Ger„tetyp*        : In das System ist ein mobiles*Breitbandger„t* eingebettet.

    Seems like the ä :
       Unicode Character 'LATIN SMALL LETTER A WITH DIAERESIS' (U+00E4)
       http://www.fileformat.info/info/unicode/char/E4/index.htm

    is wrongly mapped to „ :
       Unicode Character 'DOUBLE LOW-9 QUOTATION MARK' (U+201E)
       http://www.fileformat.info/info/unicode/char/201e/index.htm

    There seems to be a code page mismatch.

    Note that "a with diaeresis" (U+00E4) is represented by 0x84 in code page 850 (MS-DOS Latin-1).
    But in Windows code page 1252 (Windows Latin-1), 0x84 is associated to "double low-9 quotation mark" (U+201E).

    I suggest you to pay attention on how you convert text.
    If the input is in code page 850, use MultiByteToWideChar (or CA2W) with code page identifier 850 to get the proper Unicode string.

    Giovanni

    Wednesday, October 17, 2012 10:00 AM
  • For me this code works when I use CP_OEMCP. Not sure if it's relevant because I don't have a mobile device and my system uses English so I had to use some tricks to get netsh to print something like Gerätetyp.

    One thing you may try is to call SetConsoleOutputCP(CP_UTF8) before _popen and then use CP_UTF8 for MultiByteToWideChar like you do.

    Wednesday, October 17, 2012 10:59 AM
  • thanks a log Giovanni... with the code page 850, it started working... Now i will check french OS let you guys know the results.


    Windows Desktop, Windows Phone developer.

    Wednesday, October 17, 2012 11:20 AM
  • Venkatachalapathi G wrote:

       // convert multibyte UTF-8 to wide string UTF-16
       int length = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1,  NULL, 0);

    Which part of "using CP_OEMCP" did you find unclear?


    Igor Tandetnik

    Wednesday, October 17, 2012 1:17 PM
  • It worked in french OS also. thanks guys for your help..


    Windows Desktop, Windows Phone developer.

    Wednesday, October 17, 2012 1:23 PM