none
C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs RRS feed

  • General discussion

  • Giovanni Dicanio presents C++ techniques for converting Unicode text between UTF-8 and UTF-16, using the Win32 APIs MultiByteToWideChar and WideCharToMultiByte. These Win32 C-interface APIs are wrapped in modern C++ code, using STL string classes to store Unicode text, and exceptions to signal error conditions.

    Read this article in the September issue of MSDN Magazine

    Thursday, September 1, 2016 5:26 PM
    Owner

All replies

  • I think std::u16string is better.

    Visual Basic 初学者 望关照!

    Sunday, September 11, 2016 4:48 AM
  • Is there a reason to prefer the MultiByteToWideChar/WideCharToMultiByte route over std::codecvt_utf8 when the latter is available? Or even directly using the std::codecvt locale facet?
    Wednesday, September 21, 2016 2:37 PM
  • There is the same need for C programs, but for them you can't rely on STL classes. The only two choices are either UTF-8 strings in char[] arrays, or UTF-16 strings in wchar_t[] arrays. UTF-8 is the natural choice in my opinion, and it's also been the choice of the GNU open source community.

    The big problem with the UTF-8 choice in Windows is that the Visual C++ standard C library does not support them properly:

    • The program arguments, stdin, and stdout, are encoded in the current console code page. (CP 437 for us-en localised versions of Windows) (You can change it to code page 65001=UTF-8, but often people don't, and problems accumulate.)
    • File I/O functions encode file names in the system code page, which is not the same! (CP 1252 for us-en localised versions of Windows) (You can't change it.)

    After struggling with C programs portability and localisation issues for years, I ended up writing my own UTF-8 compatibility library layer over the Visual C++ standard C library. This library, called MsvcLibX, is available as open source at https://github.com/JFLarvoire/SysToolsLib. Main features:

    • C sources encoded in UTF-8, using normal char[] C strings, and standard C library APIs.
    • In any code page, everything is processed internally as UTF-8 in your code, with input and output in the right code page.
    • All stdio.h file functions support UTF-8 pathnames > 260 characters, up to 64 KBytes actually.
    • The same sources can compile and link successfully in Windows using Visual C++ and MsvcLibX and Visual C++ C library, and in Linux using gcc and Linux standard C library, with no need for #ifdef ... #endif blocks.
    • Adds include files common in Linux, but missing in Visual C++. Ex: unistd.h
    • Adds missing functions, like those for directory I/O, symbolic link management, etc, all with UTF-8 support of course :-).

    More details in the MsvcLibX README on GitHub, including how to build the library and use it in your own programs.

    The release section in the above GitHub repository provides several programs using this MsvcLibX library, that will show its capabilities. Ex: Try my which.exe tool with directories with non-ASCII names in the PATH, searching for programs with non-ASCII names, and changing code pages.

    This MsvcLibX library is by no means complete, and contributions for improving it are welcome!

    Thursday, October 20, 2016 4:06 PM
  • Is there a reason to prefer the MultiByteToWideChar/WideCharToMultiByte route over std::codecvt_utf8 when the latter is available? Or even directly using the std::codecvt locale facet?

    Performance of direct Win32 API calls is much better than std::codecvt, as showed here.


    Tuesday, December 6, 2016 10:59 PM
  • When interacting with Windows APIs you are already in Windows platform-specific code, and std::wstring is fine. Note that wstring is based on wchar_t, as Win32 Unicode APIs. Instead u16string is based on char16_t, which is a distinct type from wchar_t. See also the answers in this thread on StackOverflow.
    Tuesday, December 6, 2016 11:16 PM