none
What is the point of wchar_t?

    Question

  • Seriously, from what I've read all that wchar_t does is to promise a larger character set. That's it. There are no size guaranties (then again, none of the types in C/C++ have any other then a minimum), no functionality guaranties, and it doesn't actually provide a standardized 'unicode' (UTF-16 in MSVC's case) solution.

    What we end up with is a bigger brother to char which doesn't do much except hold more characters. The encoding isn't consistent for unicode use and neither are endienness (then again, that's CPU/hardware dependent) nor size (other than a minimum).

    If I'm making a cross-platform app that has to deal with quite a few OSes (Mac, Linux, WinXX) this isn't nice. At all. To top it all off apparently Linux/OSX use UTF-32 instead of UTF-16, but that's an encoding issue.

    Perhaps I'm missing a piece of the puzzle and throwing my conclusion all off, but wchar_t seems to be a joke. One might as well implement their own UTF-8 or UTF-16 handling routines then use the wchar headers, which seem to be totally missing the point of having a std lib. Hopefully I'm wrong on something here...
    Monday, July 13, 2009 9:01 PM

Answers

  • One might as well implement their own UTF-8 or UTF-16 handling routines then use the wchar headers, which seem to be totally missing the point of having a std lib. Hopefully I'm wrong on something here...
    Imagine trying to write the C++ runtime library in a manner that supports UTF-8. In fact, try implementing:

    char str[] = Ĥellö, world".
    char *t = str;
    t++;

    This is much more complicating that supporting UTF-16 with the help of wchar_t. Moreover, there is a lot of legacy code out there that will measure the length of a string by doing something like

    char *EndOfString, *BeginningOfString;

    len = (EndOfString - BeginningOfString) / sizeof (BeginningOfString[0]);


    • Marked as answer by Olivier Hamel Monday, July 13, 2009 9:43 PM
    • Edited by Brian Muth Monday, July 13, 2009 9:57 PM code fix
    Monday, July 13, 2009 9:13 PM
  • The C/C++ standard doesn't even guarantee the size of int, why would you expect wchar_t to be different?  Register your complaint with your nearest committee member.  For cross-platform development, using a language that runs on a virtual machine that can provide type size guarantees is much less painful.  Ruby, Python, Java, Silverlight off the top of my head.

    Hans Passant.
    • Marked as answer by Olivier Hamel Monday, July 13, 2009 9:44 PM
    Monday, July 13, 2009 9:29 PM
    Moderator

All replies

  • One might as well implement their own UTF-8 or UTF-16 handling routines then use the wchar headers, which seem to be totally missing the point of having a std lib. Hopefully I'm wrong on something here...
    Imagine trying to write the C++ runtime library in a manner that supports UTF-8. In fact, try implementing:

    char str[] = Ĥellö, world".
    char *t = str;
    t++;

    This is much more complicating that supporting UTF-16 with the help of wchar_t. Moreover, there is a lot of legacy code out there that will measure the length of a string by doing something like

    char *EndOfString, *BeginningOfString;

    len = (EndOfString - BeginningOfString) / sizeof (BeginningOfString[0]);


    • Marked as answer by Olivier Hamel Monday, July 13, 2009 9:43 PM
    • Edited by Brian Muth Monday, July 13, 2009 9:57 PM code fix
    Monday, July 13, 2009 9:13 PM
  • The C/C++ standard doesn't even guarantee the size of int, why would you expect wchar_t to be different?  Register your complaint with your nearest committee member.  For cross-platform development, using a language that runs on a virtual machine that can provide type size guarantees is much less painful.  Ruby, Python, Java, Silverlight off the top of my head.

    Hans Passant.
    • Marked as answer by Olivier Hamel Monday, July 13, 2009 9:44 PM
    Monday, July 13, 2009 9:29 PM
    Moderator
  • Good points. (@nobugz, I am, Lua, but not for low level stuff.) Ugh, I guess this is what they call 'experience'. I'll just jump blindly with wchar_t and see what happens and adapt if it causes problems. I still have a nag that this isn't "right" but I'll let it go for now. Thank you again for time and advice.
    Monday, July 13, 2009 9:43 PM
  • UNICODE does not predate C, so it would be a better question to posit why there are so many standards in UNICODE.

    I believe integers have, by convention at least, always been the size of the bus, or a register. (Observation only)

    OpenFont files are stored in Motorola format...how much sense does that make on my PC? (Then again, I am not supposed to be opening them myself, either.)

    Brian makes a good point about diacriticals, contextual drawing of glyphs, and also (I assume) RTL is something that is outside of an encoding proper, yet it is still a concern for code that must try to handle so many wildly different languages)

    I get frustrated too, but it seems to me that MS is actually doing a pretty good job at pleasing as many cultures as they can.

    Monday, July 13, 2009 10:08 PM
  • Olivier Hamel wrote:
    > Seriously, from what I've read all that wchar_t does is to promise a
    > larger character set. That's it. There are no size guaranties (then
    > again, none of the types in C/C++ have any other then a minimum), no
    > functionality guaranties, and it doesn't actually provide a
    > standardized 'unicode' (UTF-16 in MSVC's case) solution.

    You might be happy to learn that the next version of the C++ standard
    (C++0x) will include char16_t and char32_t as native types. It will
    support UTF-8, UTF-16 and UTF-32 string literals ( u8"string",
    u"string", U"string").

    There's not much guaranteed in the way of Unicode support in the
    library, except for classes codecvt_utf8, codecvt_utf16 and
    codecvt_utf8_utf16 which can convert between various Unicode encodings
    (as well as native wide character encoding as represented by wchar_t,
    whatever that might be). You'd still need third-party libraries for
    things like, say, locale-sensitive collation.
    --
    Igor Tandetnik


    Tuesday, July 14, 2009 2:06 AM
  • Brian Muth wrote:
    > Imagine trying to write the C++ runtime library in a manner that
    > supports UTF-8. In fact, try implementing:
    >
    > char str[] = Hello, world".
    > char *t = str;
    > t++;
    >
    > This is much more complicating that supporting UTF-16 with the help
    > of wchar_t. Moreover, there is a lot of legacy code out there that
    > will measure the length of a string by doing something like
    >
    > char *EndOfString, *BeginningOfString;
    >
    > len = (EndOfString - BeginningOfString) / sizeof
    > (BeginningOfString[0]);

    If this is wrong, what would be the correct way to determine the length
    of a UTF-8 string? If you answer is "the length in Unicode codepoints",
    what precisely is this number useful for? You can't allocate memory
    based on it, and it doesn't necessarily correspond to the length in
    "characters" (aka glyphs aka graphemes) the user would recognize.

    In my humble opinion, plain old strlen works just fine on UTF-8 strings,
    returning the length of the string I can use for memory allocation.
    Other useful meanings of "length" (say, width in pixels when the string
    is rendered in a given font) are best left to specialized libraries.
    Same is true for UTF-16 - if, say, wcslen were trying to adjust for
    surrogate pairs, it would be doing me a disservice.
    --
    Igor Tandetnik


    Tuesday, July 14, 2009 2:37 AM