none
unicode newline problem

    Question

  • I have been using _snwprintf and notepad interpreted my files as ansii instead of unicode which they were supposed to be. I looked at the bits and saw that my \r\n were written as 0x000D(CR, ok) and 0x0D0A(ansii CR+LF!?) which made the mistake. I than tried to manualy insert Unicode LF into my file buffer instead doing so in _snwprint - same result. it seems that the compiler makes this.

    As far as I know the corrent UNICODE line ending on windows platform is \r\n, that is 0x000D000A and ansii 0x0D0A

    if this is not so, than there is a bug in notepad, worldpad and msvs08 text editor(all of these treat the file as ansii)


    I'm compiling using msvs08 sp1

    -Greg
    • Edited by G386 Saturday, August 15, 2009 9:55 AM
    Saturday, August 15, 2009 9:49 AM

Answers

  • Expanding a \n to \r\n is automatic for files opened in text mode.
    To get just a \n the file must be opened in binary mode.

    - Wayne
    • Marked as answer by G386 Saturday, August 15, 2009 11:29 AM
    Saturday, August 15, 2009 9:59 AM

All replies

  • Expanding a \n to \r\n is automatic for files opened in text mode.
    To get just a \n the file must be opened in binary mode.

    - Wayne
    • Marked as answer by G386 Saturday, August 15, 2009 11:29 AM
    Saturday, August 15, 2009 9:59 AM
  • doesnt matter - even if I only write a \n notepad still interprets the file as ansii (in binary mode the char written is not an unicode \n but ansii \r\n, eg the expanded \n)
    -Greg
    Saturday, August 15, 2009 10:02 AM
  • _snwprintf() writes to a string, not a file.  I'd guess it goes wrong when you then write that string to a file.  Post your code. 

    Also, a properly encoded UTF16 file should have a BOM, FF FE in the first 2 byte positions.

    Hans Passant.
    Saturday, August 15, 2009 10:03 AM
    Moderator
  • int _valog(wchar_t* str, va_list* va){
        if(!inited)
            init();//init with default values
        int size, ret;
        wchar_t* tmp    = new wchar_t[1024];
        tmp[0]=L'\0';
        ret            = _vsnwprintf(tmp, 1022, str,(*va) );
        if(ret==-1)
            for(size=0; tmp[size]!=L'\0'&&size!=1024; size++);
        else
            size    = ret;
        tmp[size++]    = L'\n';
        ret            = fwrite(tmp, sizeof(wchar_t), size, file);
        fflush(file);
        delete [] tmp;
        return (ret==size)?size:-1;
    }

    also, I've changed the init to add the little-endian BOM at the begginging, same error. I've checked contents of L'\n' runtime and it's fine then.

    the file: http://dc104.2shared.com/download/7190680/e45746c7/log.txt?tsid=20090815-063001-972052a7  //direct link, hope it works, if not try next
    http://www.2shared.com/file/7190680/e45746c7/log.html
    -Greg
    Saturday, August 15, 2009 10:04 AM
  • The file does not seem to be opened in binary mode (with “b” flag instead of “t”). Check again, keep the BOM, and also execute this sequence: tmp[size++] = L'\r'; tmp[size++] = L'\n'; tmp[size++] = 0; at the end of each line.
    • Marked as answer by G386 Saturday, August 15, 2009 11:28 AM
    • Unmarked as answer by G386 Saturday, August 15, 2009 2:00 PM
    Saturday, August 15, 2009 11:19 AM
  • worked like a charm, tho the string termination is not required, only binary...thanks

    -Greg
    • Edited by G386 Saturday, August 15, 2009 11:28 AM
    Saturday, August 15, 2009 11:23 AM
  • I see nothing in your code snippet here which tells me how you opened the file.

    Did you use something like this?

    FILE * fp;
    fp = fopen("utest.dat", "w, ccs=UNICODE");

    Or this?

    fp = fopen("utest.dat", "w, ccs=UTF-16LE");

    Other?

    - Wayne
    Saturday, August 15, 2009 11:31 AM
  • bool init(){
        if(file)
            fclose(file);
        //try to open for reading
        file    = fopen("log.txt", "r");
        if(file){
            fclose(file);
            file    = fopen("log.txt", "ab");
        }else{
            char utf16[2] = {0xFF, 0xFE};
            file    = fopen("log.txt", "wb");
            fwrite(utf16, 1, 2, file);
        }
        inited    = true;
        return true;
    };

    I dont deal with text files alot so I just went for a quick solution.
    -Greg
    Saturday, August 15, 2009 2:02 PM

  • I dont deal with text files alot so I just went for a quick solution.
    Well, the quick solution is the wrong solution.

    You are confusing two concepts. Many of the C Runtime calls come in pairs: fopen, _wfopen and fprintf, fwprintf, etc, depending on whether the parameters accept char* or a wchar_t* data types. This has nothing to do with the encoding of the file itself. For example, fwprintf() will print ANSI strings to a file that has been opened with ANSI encoding (the default), even though its parameters are specified as wide character strings. In your example, of fopen ("log.txt", "r") the encoding is ANSI.

    For newly created files, you can specify the encoding. Wayne already pointed you in this direction. You can indicate which encoding you want by specifying either ccs=ANSI, ccs=UNICODE, ccs=UTF-8, or css=UTF_16LE in the fopen() call. Read more about it here. I find the documentation quite straightforward, but if you have questions, feel free to post them.
    Saturday, August 15, 2009 4:34 PM
  • Quote>if you have questions, feel free to post them.

    OK Brian, I'll call you on that. ;-)

    (1) The table in the help suggests that if UNICODE is used for a new file it will be written as ANSI.
    I'm not seeing that result. The file is in fact being created as UTF-16LE whether it preexisted or not.

    (2) You said:

    Quote>fwprintf() will print ANSI strings to a file that has been opened with ANSI encoding
    Quote>(the default), even though its parameters are specified as wide character strings.

    In my experience, if you try to fwprintf a wchar_t string to a file opened as ANSI (no css
    specified) you will get a runtime assert in debug mode or a fatal error in release mode.

    - Wayne

    P.S. - A small typo in your post: UTF_16LE should be UTF-16LE
    Saturday, August 15, 2009 6:32 PM
  • Follow-up: I can't seem to reproduce the symptoms I described in (2) above.
    In my latest tests it does indeed seem to behave as you described.
    Now I have to waste more grey cells trying to repro my earlier results,
    which were definite and consistent previously.

    - Wayne
    Saturday, August 15, 2009 7:07 PM


  • (1) The table in the help suggests that if UNICODE is used for a new file it will be written as ANSI.

    Wayne, I don't see that anywhere. Can you provide a quote I can search for?
    P.S. - A small typo in your post: UTF_16LE should be UTF-16LE
    Yup. Thanks.
    Saturday, August 15, 2009 7:26 PM
  • Follow-up to the Follow-up:

    The debug assertion may have occurred under the following conditions:

    file exists: no
    fopen: ccs=UNICODE
    fprintf (not fwprintf) of char (not wchar_t) string compiles clean but asserts in debug mode

    - Wayne

    Saturday, August 15, 2009 7:26 PM
  • Quote>I don't see that anywhere. Can you provide a quote I can search for?

    At the link you provided, scroll down to the table "Encodings Used Based on Flag and BOM".
    It says for Flag UNICODE and No BOM (or new file) encoding is ANSI.

    - Wayne

    Saturday, August 15, 2009 7:37 PM
  • Brian -

    One further question I meant to toss at you in my earlier post. You said:

    Quote>You can indicate which encoding you want by specifying either ccs=ANSI, ...

    I don't believe ccs=ANSI is acceptable. If that's what one wants, then they just omit the
    ccs argument altogether. YMMV

    - Wayne

    Saturday, August 15, 2009 9:30 PM
  • Quote>I don't see that anywhere. Can you provide a quote I can search for?

    At the link you provided, scroll down to the table "Encodings Used Based on Flag and BOM".
    It says for Flag UNICODE and No BOM (or new file) encoding is ANSI.

    - Wayne


    My interpretation of that clause is that it applies to existing files being opened, as opposed to a new file. Indeed after reading the paragraph several times the documentation is confusing, if not outrightly wrong. However, when the flag is UNICODE and the file is new, a BOM is created and the encoding is UTF-16, which is what one would expect.
    Saturday, August 15, 2009 10:23 PM
  • Follow-up to the Follow-up:

    The debug assertion may have occurred under the following conditions:

    file exists: no
    fopen: ccs=UNICODE
    fprintf (not fwprintf) of char (not wchar_t) string compiles clean but asserts in debug mode

    - Wayne



    This is very interesting, and I've confirmed this. The same behaviour is seen in VS2010 Beta 1. This doesn't appear to be documented anywhere. I'd venture to say that this is a bona fide bug. I can't see why there would be this restriction. Why not submit this to Microsoft Connect and see what they say, Wayne?

    Nice catch!
     
    I don't believe ccs=ANSI is acceptable. If that's what one wants, then they just omit the
    ccs argument altogether. YMMV
    Yes, you are correct.
    Saturday, August 15, 2009 10:40 PM
  • Is this a MS standard, I havent seen it in any other documentation
    -Greg
    Sunday, August 16, 2009 7:35 AM