locked
Loss-free conversion from Unicode(UTF-16) to multibyte(MBCS) and back for multiple different languages? RRS feed

  • Question

  • Hello everybody.

    I read a lot about discrete conversions from Unicode (2-byte character set UTF-16) to multibyte representations and back using an appropriate locale but this is not what I am searching for. (btw: The following is for unmanaged C++.)

    I am maintaining a server application which may get input from anywhere in different languages e.g. english, korean, chinese, arabic, ... as std::wstring. Some of these inputs are passed to a common library with shared sources for unix and windows which convert these wstring contents to mbcs via wcstombs. But this will not work correctly and converting the MBCS string back to unicode (via mbstowcs) will not retain the original input. Of course I can not set a specific locale for the whole server application.

    I thought that a conversion from unicode to a multibyte representation is always defined but only the character visualization (and the keyboard layout) will be given by a code page. Is there a standard way to perform this wanted loss-free conversion for multiple languages or is it impossible?

    Note that my server application uses a multibyte character set as project setting for historical reasons. But I already tested with code in a small unicode project and tried to use WideString2MultiByte with CP_UTF8 as first parameter without success.

    Regards
     Frank

    Monday, June 23, 2008 11:18 AM

Answers

  • FRank:

    You should be able to use WideCharToMultiByte() and MultiByteToWideChar() with CP_UTF8 to convert any valid UTF-16 string to UTF-8 and back again. If you were not able to do this you must have done something wrong.

    You can also use the simpler class-based CA2W and CW2A in atlconv.h. (at least using VS2008 or VS2005 SP1; in earlier versions there was a bug in CW2A that did not create a large enough buffer).

    Use Unicode settings and forget about the local code page.


    David Wilkinson | Visual C++ MVP
    • Marked as answer by Yan-Fei Wei Wednesday, June 25, 2008 8:04 AM
    Monday, June 23, 2008 1:16 PM
  • Your buffer is no doubt too small, UTF8 can need up to 4 bytes per character.  Use GetLastError() to be sure.  Getting back to my earlier comment about treating the string as binary data: you can still store it in a char*, it's still going to be a zero-terminated string.  But only code points 1-127 will survive as-is, the rest get encoded.  Running a SQL query on those encoded strings can't work, unless SQL has a UTF8 encoded string data type.  I doubt it.
    Hans Passant.
    • Marked as answer by Yan-Fei Wei Wednesday, June 25, 2008 8:04 AM
    Monday, June 23, 2008 3:54 PM
  • Frank:

    You are not using WideCharToMultiByte() correctly. For CP_UTF8, the last parameter must be NULL.

    Also, as Hans noted, you need 4 times the length to be safe with converting to UTF8.


    David Wilkinson | Visual C++ MVP
    • Proposed as answer by Frank2068 Tuesday, June 24, 2008 3:17 PM
    • Marked as answer by Yan-Fei Wei Wednesday, June 25, 2008 8:04 AM
    Tuesday, June 24, 2008 12:04 PM
  • Could you check if the following simplified test works in your environment?

     

    const wchar_t wsz1[] = L" \xe801\xe802\xe803\xe804\xe805\xe818\xe83A "
    char sz[100]; 
    if(WideCharToMultiByte(CP_UTF8, 0, wsz1, -1, sz, 100, NULL, NULL) != 0) 
        wchar_t wsz2[100]; 
        if(MultiByteToWideChar(CP_UTF8, 0, sz, -1, wsz2, 100) != 0) 
        { 
            assert(lstrcmpW(wsz1, wsz2) == 0); 
        }  
        else 
        { 
            assert(false); 
        } 
    else 
        assert(false); 
     
     

     

    • Edited by Viorel_MVP Tuesday, June 24, 2008 3:43 PM Edit
    • Marked as answer by Frank2068 Wednesday, June 25, 2008 12:23 PM
    Tuesday, June 24, 2008 3:43 PM
  • Frank:

    UTF-7 and UTF-8 are not valid Windows code pages. That is why you have to use WideCharToMultiByte(), which can specify these encodings.

    If you are not round-tripping to UTF-8 and back correctly, I think it is because (for reason I do not quite understand) your last parameter to MultiByteToWideChar() is incorrect. Make it len+1.

    BTW, when you indicate your question is answered, you should use "mark as answer" not "propose as answer". For me, "propose as answer" makes no sense for the original poster and I think it should be disabled. (actually I would get rid of it altogether ...)


    David Wilkinson | Visual C++ MVP
    • Marked as answer by Yan-Fei Wei Wednesday, June 25, 2008 8:04 AM
    Tuesday, June 24, 2008 3:57 PM

All replies

  • FRank:

    You should be able to use WideCharToMultiByte() and MultiByteToWideChar() with CP_UTF8 to convert any valid UTF-16 string to UTF-8 and back again. If you were not able to do this you must have done something wrong.

    You can also use the simpler class-based CA2W and CW2A in atlconv.h. (at least using VS2008 or VS2005 SP1; in earlier versions there was a bug in CW2A that did not create a large enough buffer).

    Use Unicode settings and forget about the local code page.


    David Wilkinson | Visual C++ MVP
    • Marked as answer by Yan-Fei Wei Wednesday, June 25, 2008 8:04 AM
    Monday, June 23, 2008 1:16 PM
  • Unicode to MBCS is a lossy conversion, especially when you need to convert strings from different cultures using the same code page.  You can convert to UTF8 as Dave pointed out but you'll have to treat the result as binary data.  You can't expect to be able to query on the generated strings.  Storing the strings in Unicode columns is the only real fix.
    Hans Passant.
    Monday, June 23, 2008 1:48 PM
  • Hi Hans (nobugz).

    If you say I have to treat the result as binary data you mean smth. like std::vector<unsigned char> to hold it? Until now the method I use converts from std::wstring to std::string (see below).

    As I am not able to use WideCharToMultiByte() or MultiByteToWideChar() directly (as Dave suggested) in the sources shared between windows and unix how could I modify the following code to work correctly with inputs like e.g.  ﻶﻹﻻﻄﺶﯓ, ☼ ? Moreover do I have to change the process using the "C"-locale?

    std::string wstring2string (const wchar_t * wstrToConvert)
    {
       size_t len = ::wcstombs (NULL, wstrToConvert, 0);

       char * ch = new char [len+1];
       memset(ch, '?', len * sizeof(char));
       ch[len] = '\0';
       len = wcstombs (ch, wstrToConvert, len+1);
       std::string strConverted = ch;
      
    delete [] ch;
      
    return strConverted;
    }

    Regards
     Frank

    Monday, June 23, 2008 2:34 PM
  • Hi Dave.

    I will try it again with my unicode test project but I am unsure what I might have done wrong. Could you give me some hints - for example common pitfalls? See my post to nobugz for a code review.

    Regards
     Frank
    Monday, June 23, 2008 2:37 PM
  • Frank:

    I have never used wctombs(), nor do I find it in the VC8 or VC9 Help. But I believe this family uses the local code page, which is a non-starter if you want to handle any Unicode string.

    If you are having trouble with WideCharToMultiByte() and CP_UTF8, then post your code.



    David Wilkinson | Visual C++ MVP
    Monday, June 23, 2008 2:53 PM
  • Hi Dave.

    In a unicode unmanaged C++ test project (VC 2003.NET) I used the following code:

    const wchar_t wsz[] = L" \xe801\xe802\xe803\xe804\xe805\xe818\xe83A ";// some arabic letter without any meaning, just for test
    const int len = sizeof(wsz)/sizeof(wchar_t) + 1;
    std::wstring wstr = wsz;
    char * ch = new char [2*len+1];
    BOOL bUsedDef = FALSE;
    if (0 != WideCharToMultiByte(CP_UTF8, 0, wsz, len, ch, 2*len, NULL, &bUsedDef) && !bUsedDef)
    {
       std::string str = ch;
      
    wchar_t * wch = new wchar_t [len+1];
      
    if (0 != MultiByteToWideChar(CP_UTF8, 0, ch, 2*len, wch, len))
       {
          assert(wstr == wch);
       }
       delete [] wch;
    }
    delete [] ch;


    Thereby the OS has no regional support loaded for this language. Executing the code fails at the first if-statement.

    Regards
     Frank

    Monday, June 23, 2008 3:23 PM
  • Your buffer is no doubt too small, UTF8 can need up to 4 bytes per character.  Use GetLastError() to be sure.  Getting back to my earlier comment about treating the string as binary data: you can still store it in a char*, it's still going to be a zero-terminated string.  But only code points 1-127 will survive as-is, the rest get encoded.  Running a SQL query on those encoded strings can't work, unless SQL has a UTF8 encoded string data type.  I doubt it.
    Hans Passant.
    • Marked as answer by Yan-Fei Wei Wednesday, June 25, 2008 8:04 AM
    Monday, June 23, 2008 3:54 PM
  • Hans,

    well, from the documentation the following line should give the required buffer size for conversion but stepping through the CRT code you will find that it only returns the number of wchar_t in the source wstring:

     size_t len = ::wcstombs (NULL, wstrToConvert, 0);

    But even when increasing the buffer size to 100times the source size the conversion will fail and not even touch the first byte. This occurs either using WideCharToMultiByte with CP_UTF8 directly or wcstombs (which redirects to WideCharToMultiByte with local codepage under Windows).

    Concerning SQL and encoding it is no problem for me to get the code points beyond 1-127 encoded as I am not searching on it. But I must ensure that reconverting back to unicode in a 2-byte representation will yield the original source again.


    Extract from the MSDN documentation for wcstombs:

    size_t wcstombs(   char *mbstr,   const wchar_t *wcstr,   size_t count );

    ... If the mbstr argument is NULL, wcstombs returns the required size of the destination string. If wcstombs encounters a wide character it cannot be convert to a multibyte character, it returns –1 cast to type size_t. ... If there are two bytes in the multibyte output string for every wide character in the input string (including the wide character NULL), the result is guaranteed to fit. ....

    The only question to me is when and why the conversion can fail (and if it has smth to do with code pages) and how to get around it.

    Regards
     Frank
    Tuesday, June 24, 2008 7:34 AM
  • Hi Dave.

    Did my posted code using WideCharToMultiByte directly worked for you or is there a bug? If none do you have any recommendations how to get it work (really appreciated)?

    Regards
     Frank
    Tuesday, June 24, 2008 7:52 AM
  • Frank:

    You are not using WideCharToMultiByte() correctly. For CP_UTF8, the last parameter must be NULL.

    Also, as Hans noted, you need 4 times the length to be safe with converting to UTF8.


    David Wilkinson | Visual C++ MVP
    • Proposed as answer by Frank2068 Tuesday, June 24, 2008 3:17 PM
    • Marked as answer by Yan-Fei Wei Wednesday, June 25, 2008 8:04 AM
    Tuesday, June 24, 2008 12:04 PM
  • Hi Dave.

    Thanks for noting this. At least WideCharToMultiByte starts doing something now but to me this is really weird. I just used the coding from the CRT within wcstombs.c which always provided the last parameter. So using wcstombs will never yield anything fruitful even if I managed to set the local codepage to CP_UTF8. But the latter does not seem to be the right way, too, because of some changes in VC2005/VC2008 to reject setting CP_UTF7/8 as local codepage.

    Well, I still cannot get a real match when converting L" \xe801\xe802\xe803\xe804\xe805\xe818\xe83A "; back and forth, I have to check how the unix implementations work and if there is some more sophisticated way to perform this task there. But it is at least a starter ...

    Thanks again.
    Tuesday, June 24, 2008 3:15 PM
  • Could you check if the following simplified test works in your environment?

     

    const wchar_t wsz1[] = L" \xe801\xe802\xe803\xe804\xe805\xe818\xe83A "
    char sz[100]; 
    if(WideCharToMultiByte(CP_UTF8, 0, wsz1, -1, sz, 100, NULL, NULL) != 0) 
        wchar_t wsz2[100]; 
        if(MultiByteToWideChar(CP_UTF8, 0, sz, -1, wsz2, 100) != 0) 
        { 
            assert(lstrcmpW(wsz1, wsz2) == 0); 
        }  
        else 
        { 
            assert(false); 
        } 
    else 
        assert(false); 
     
     

     

    • Edited by Viorel_MVP Tuesday, June 24, 2008 3:43 PM Edit
    • Marked as answer by Frank2068 Wednesday, June 25, 2008 12:23 PM
    Tuesday, June 24, 2008 3:43 PM
  • Frank:

    UTF-7 and UTF-8 are not valid Windows code pages. That is why you have to use WideCharToMultiByte(), which can specify these encodings.

    If you are not round-tripping to UTF-8 and back correctly, I think it is because (for reason I do not quite understand) your last parameter to MultiByteToWideChar() is incorrect. Make it len+1.

    BTW, when you indicate your question is answered, you should use "mark as answer" not "propose as answer". For me, "propose as answer" makes no sense for the original poster and I think it should be disabled. (actually I would get rid of it altogether ...)


    David Wilkinson | Visual C++ MVP
    • Marked as answer by Yan-Fei Wei Wednesday, June 25, 2008 8:04 AM
    Tuesday, June 24, 2008 3:57 PM
  • Hi Viorel.

    Thanks for your posting. Indeed this works without constraints. Things could be so easy if you just let Microsoft programmers care about string length calculation (-1). But of course the constant size buffer is just for testing here and I have to adjust it for my program to be variable. I guess I have to call WideCharToMultiByte(CP_UTF8, 0, wsz, -1, NULL, 0, NULL, NULL); for that.

    Thanks again.
    • Edited by Frank2068 Wednesday, June 25, 2008 12:21 PM correct typo
    Wednesday, June 25, 2008 12:19 PM