Answered by:
What is the point of wchar_t?

-
Seriously, from what I've read all that wchar_t does is to promise a larger character set. That's it. There are no size guaranties (then again, none of the types in C/C++ have any other then a minimum), no functionality guaranties, and it doesn't actually provide a standardized 'unicode' (UTF-16 in MSVC's case) solution.
What we end up with is a bigger brother to char which doesn't do much except hold more characters. The encoding isn't consistent for unicode use and neither are endienness (then again, that's CPU/hardware dependent) nor size (other than a minimum).
If I'm making a cross-platform app that has to deal with quite a few OSes (Mac, Linux, WinXX) this isn't nice. At all. To top it all off apparently Linux/OSX use UTF-32 instead of UTF-16, but that's an encoding issue.
Perhaps I'm missing a piece of the puzzle and throwing my conclusion all off, but wchar_t seems to be a joke. One might as well implement their own UTF-8 or UTF-16 handling routines then use the wchar headers, which seem to be totally missing the point of having a std lib. Hopefully I'm wrong on something here...
Question
Answers
-
One might as well implement their own UTF-8 or UTF-16 handling routines then use the wchar headers, which seem to be totally missing the point of having a std lib. Hopefully I'm wrong on something here...
Imagine trying to write the C++ runtime library in a manner that supports UTF-8. In fact, try implementing:
char str[] = Ĥellö, world".
char *t = str;
t++;
This is much more complicating that supporting UTF-16 with the help of wchar_t. Moreover, there is a lot of legacy code out there that will measure the length of a string by doing something like
char *EndOfString, *BeginningOfString;
len = (EndOfString - BeginningOfString) / sizeof (BeginningOfString[0]);- Marked as answer by Olivier Hamel Monday, July 13, 2009 9:43 PM
- Edited by Brian Muth Monday, July 13, 2009 9:57 PM code fix
-
The C/C++ standard doesn't even guarantee the size of int, why would you expect wchar_t to be different? Register your complaint with your nearest committee member. For cross-platform development, using a language that runs on a virtual machine that can provide type size guarantees is much less painful. Ruby, Python, Java, Silverlight off the top of my head.
Hans Passant.- Marked as answer by Olivier Hamel Monday, July 13, 2009 9:44 PM
All replies
-
One might as well implement their own UTF-8 or UTF-16 handling routines then use the wchar headers, which seem to be totally missing the point of having a std lib. Hopefully I'm wrong on something here...
Imagine trying to write the C++ runtime library in a manner that supports UTF-8. In fact, try implementing:
char str[] = Ĥellö, world".
char *t = str;
t++;
This is much more complicating that supporting UTF-16 with the help of wchar_t. Moreover, there is a lot of legacy code out there that will measure the length of a string by doing something like
char *EndOfString, *BeginningOfString;
len = (EndOfString - BeginningOfString) / sizeof (BeginningOfString[0]);- Marked as answer by Olivier Hamel Monday, July 13, 2009 9:43 PM
- Edited by Brian Muth Monday, July 13, 2009 9:57 PM code fix
-
The C/C++ standard doesn't even guarantee the size of int, why would you expect wchar_t to be different? Register your complaint with your nearest committee member. For cross-platform development, using a language that runs on a virtual machine that can provide type size guarantees is much less painful. Ruby, Python, Java, Silverlight off the top of my head.
Hans Passant.- Marked as answer by Olivier Hamel Monday, July 13, 2009 9:44 PM
-
Good points. (@nobugz, I am, Lua, but not for low level stuff.) Ugh, I guess this is what they call 'experience'. I'll just jump blindly with wchar_t and see what happens and adapt if it causes problems. I still have a nag that this isn't "right" but I'll let it go for now. Thank you again for time and advice.
-
UNICODE does not predate C, so it would be a better question to posit why there are so many standards in UNICODE.
I believe integers have, by convention at least, always been the size of the bus, or a register. (Observation only)
OpenFont files are stored in Motorola format...how much sense does that make on my PC? (Then again, I am not supposed to be opening them myself, either.)
Brian makes a good point about diacriticals, contextual drawing of glyphs, and also (I assume) RTL is something that is outside of an encoding proper, yet it is still a concern for code that must try to handle so many wildly different languages)
I get frustrated too, but it seems to me that MS is actually doing a pretty good job at pleasing as many cultures as they can. -
Olivier Hamel wrote:
> Seriously, from what I've read all that wchar_t does is to promise a
> larger character set. That's it. There are no size guaranties (then
> again, none of the types in C/C++ have any other then a minimum), no
> functionality guaranties, and it doesn't actually provide a
> standardized 'unicode' (UTF-16 in MSVC's case) solution.
You might be happy to learn that the next version of the C++ standard
(C++0x) will include char16_t and char32_t as native types. It will
support UTF-8, UTF-16 and UTF-32 string literals ( u8"string",
u"string", U"string").
There's not much guaranteed in the way of Unicode support in the
library, except for classes codecvt_utf8, codecvt_utf16 and
codecvt_utf8_utf16 which can convert between various Unicode encodings
(as well as native wide character encoding as represented by wchar_t,
whatever that might be). You'd still need third-party libraries for
things like, say, locale-sensitive collation.
--
Igor Tandetnik -
Brian Muth wrote:
> Imagine trying to write the C++ runtime library in a manner that
> supports UTF-8. In fact, try implementing:
>
> char str[] = Hello, world".
> char *t = str;
> t++;
>
> This is much more complicating that supporting UTF-16 with the help
> of wchar_t. Moreover, there is a lot of legacy code out there that
> will measure the length of a string by doing something like
>
> char *EndOfString, *BeginningOfString;
>
> len = (EndOfString - BeginningOfString) / sizeof
> (BeginningOfString[0]);
If this is wrong, what would be the correct way to determine the length
of a UTF-8 string? If you answer is "the length in Unicode codepoints",
what precisely is this number useful for? You can't allocate memory
based on it, and it doesn't necessarily correspond to the length in
"characters" (aka glyphs aka graphemes) the user would recognize.
In my humble opinion, plain old strlen works just fine on UTF-8 strings,
returning the length of the string I can use for memory allocation.
Other useful meanings of "length" (say, width in pixels when the string
is rendered in a given font) are best left to specialized libraries.
Same is true for UTF-16 - if, say, wcslen were trying to adjust for
surrogate pairs, it would be doing me a disservice.
--
Igor Tandetnik