unicode newline problem
-
Saturday, August 15, 2009 9:49 AMI have been using _snwprintf and notepad interpreted my files as ansii instead of unicode which they were supposed to be. I looked at the bits and saw that my \r\n were written as 0x000D(CR, ok) and 0x0D0A(ansii CR+LF!?) which made the mistake. I than tried to manualy insert Unicode LF into my file buffer instead doing so in _snwprint - same result. it seems that the compiler makes this.
As far as I know the corrent UNICODE line ending on windows platform is \r\n, that is 0x000D000A and ansii 0x0D0A
if this is not so, than there is a bug in notepad, worldpad and msvs08 text editor(all of these treat the file as ansii)
I'm compiling using msvs08 sp1
-Greg- Edited by G386 Saturday, August 15, 2009 9:55 AM
All Replies
-
Saturday, August 15, 2009 9:59 AM
Expanding a \n to \r\n is automatic for files opened in text mode.
To get just a \n the file must be opened in binary mode.
- Wayne- Marked As Answer by G386 Saturday, August 15, 2009 11:29 AM
-
Saturday, August 15, 2009 10:02 AMdoesnt matter - even if I only write a \n notepad still interprets the file as ansii (in binary mode the char written is not an unicode \n but ansii \r\n, eg the expanded \n)
-Greg -
Saturday, August 15, 2009 10:03 AMModerator_snwprintf() writes to a string, not a file. I'd guess it goes wrong when you then write that string to a file. Post your code.
Also, a properly encoded UTF16 file should have a BOM, FF FE in the first 2 byte positions.
Hans Passant. -
Saturday, August 15, 2009 10:04 AMint _valog(wchar_t* str, va_list* va){
if(!inited)
init();//init with default values
int size, ret;
wchar_t* tmp = new wchar_t[1024];
tmp[0]=L'\0';
ret = _vsnwprintf(tmp, 1022, str,(*va) );
if(ret==-1)
for(size=0; tmp[size]!=L'\0'&&size!=1024; size++);
else
size = ret;
tmp[size++] = L'\n';
ret = fwrite(tmp, sizeof(wchar_t), size, file);
fflush(file);
delete [] tmp;
return (ret==size)?size:-1;
}
also, I've changed the init to add the little-endian BOM at the begginging, same error. I've checked contents of L'\n' runtime and it's fine then.
the file: http://dc104.2shared.com/download/7190680/e45746c7/log.txt?tsid=20090815-063001-972052a7 //direct link, hope it works, if not try next
http://www.2shared.com/file/7190680/e45746c7/log.html
-Greg -
Saturday, August 15, 2009 11:19 AMThe file does not seem to be opened in binary mode (with “b” flag instead of “t”). Check again, keep the BOM, and also execute this sequence: tmp[size++] = L'\r'; tmp[size++] = L'\n'; tmp[size++] = 0; at the end of each line.
-
Saturday, August 15, 2009 11:23 AMworked like a charm, tho the string termination is not required, only binary...thanks
-Greg -
Saturday, August 15, 2009 11:31 AMI see nothing in your code snippet here which tells me how you opened the file.
Did you use something like this?
FILE * fp;
fp = fopen("utest.dat", "w, ccs=UNICODE");
Or this?
fp = fopen("utest.dat", "w, ccs=UTF-16LE");
Other?
- Wayne -
Saturday, August 15, 2009 2:02 PMbool init(){
if(file)
fclose(file);
//try to open for reading
file = fopen("log.txt", "r");
if(file){
fclose(file);
file = fopen("log.txt", "ab");
}else{
char utf16[2] = {0xFF, 0xFE};
file = fopen("log.txt", "wb");
fwrite(utf16, 1, 2, file);
}
inited = true;
return true;
};
I dont deal with text files alot so I just went for a quick solution.
-Greg -
Saturday, August 15, 2009 4:34 PM
Well, the quick solution is the wrong solution.
I dont deal with text files alot so I just went for a quick solution.
You are confusing two concepts. Many of the C Runtime calls come in pairs: fopen, _wfopen and fprintf, fwprintf, etc, depending on whether the parameters accept char* or a wchar_t* data types. This has nothing to do with the encoding of the file itself. For example, fwprintf() will print ANSI strings to a file that has been opened with ANSI encoding (the default), even though its parameters are specified as wide character strings. In your example, of fopen ("log.txt", "r") the encoding is ANSI.
For newly created files, you can specify the encoding. Wayne already pointed you in this direction. You can indicate which encoding you want by specifying either ccs=ANSI, ccs=UNICODE, ccs=UTF-8, or css=UTF_16LE in the fopen() call. Read more about it here. I find the documentation quite straightforward, but if you have questions, feel free to post them. -
Saturday, August 15, 2009 6:32 PMQuote>if you have questions, feel free to post them.
OK Brian, I'll call you on that. ;-)
(1) The table in the help suggests that if UNICODE is used for a new file it will be written as ANSI.
I'm not seeing that result. The file is in fact being created as UTF-16LE whether it preexisted or not.
(2) You said:
Quote>fwprintf() will print ANSI strings to a file that has been opened with ANSI encoding
Quote>(the default), even though its parameters are specified as wide character strings.
In my experience, if you try to fwprintf a wchar_t string to a file opened as ANSI (no css
specified) you will get a runtime assert in debug mode or a fatal error in release mode.
- Wayne
P.S. - A small typo in your post: UTF_16LE should be UTF-16LE -
Saturday, August 15, 2009 7:07 PMFollow-up: I can't seem to reproduce the symptoms I described in (2) above.
In my latest tests it does indeed seem to behave as you described.
Now I have to waste more grey cells trying to repro my earlier results,
which were definite and consistent previously.
- Wayne -
Saturday, August 15, 2009 7:26 PM
(1) The table in the help suggests that if UNICODE is used for a new file it will be written as ANSI.
Wayne, I don't see that anywhere. Can you provide a quote I can search for?
P.S. - A small typo in your post: UTF_16LE should be UTF-16LE
Yup. Thanks. -
Saturday, August 15, 2009 7:26 PMFollow-up to the Follow-up:
The debug assertion may have occurred under the following conditions:
file exists: no
fopen: ccs=UNICODE
fprintf (not fwprintf) of char (not wchar_t) string compiles clean but asserts in debug mode
- Wayne
-
Saturday, August 15, 2009 7:37 PMQuote>I don't see that anywhere. Can you provide a quote I can search for?
At the link you provided, scroll down to the table "Encodings Used Based on Flag and BOM".
It says for Flag UNICODE and No BOM (or new file) encoding is ANSI.
- Wayne
-
Saturday, August 15, 2009 9:30 PMBrian -
One further question I meant to toss at you in my earlier post. You said:
Quote>You can indicate which encoding you want by specifying either ccs=ANSI, ...
I don't believe ccs=ANSI is acceptable. If that's what one wants, then they just omit the
ccs argument altogether. YMMV
- Wayne
-
Saturday, August 15, 2009 10:23 PM
Quote>I don't see that anywhere. Can you provide a quote I can search for?
At the link you provided, scroll down to the table "Encodings Used Based on Flag and BOM".
It says for Flag UNICODE and No BOM (or new file) encoding is ANSI.
- Wayne
My interpretation of that clause is that it applies to existing files being opened, as opposed to a new file. Indeed after reading the paragraph several times the documentation is confusing, if not outrightly wrong. However, when the flag is UNICODE and the file is new, a BOM is created and the encoding is UTF-16, which is what one would expect. -
Saturday, August 15, 2009 10:40 PM
Follow-up to the Follow-up:
The debug assertion may have occurred under the following conditions:
file exists: no
fopen: ccs=UNICODE
fprintf (not fwprintf) of char (not wchar_t) string compiles clean but asserts in debug mode
- Wayne
This is very interesting, and I've confirmed this. The same behaviour is seen in VS2010 Beta 1. This doesn't appear to be documented anywhere. I'd venture to say that this is a bona fide bug. I can't see why there would be this restriction. Why not submit this to Microsoft Connect and see what they say, Wayne?
Nice catch!
I don't believe ccs=ANSI is acceptable. If that's what one wants, then they just omit the
Yes, you are correct.
ccs argument altogether. YMMV -
Sunday, August 16, 2009 7:35 AMIs this a MS standard, I havent seen it in any other documentation
-Greg

