locked
Possible bugs in WriteFile and CRT (Unicode issues) RRS feed

  • Question

  • I stumbled upon some bugs/weird behavior when experimenting with Unicode output.

    First is that WriteFile seems to return the number of characters written, instead of the number of bytes. I'm not sure whether this is a bug with WriteFile, or with its documentation, or something else. Test program:

    #define WIN32_LEAN_AND_MEAN
    #include <windows.h>
    
    /* WriteFile returns number of characters written, instead of number of bytes.
       Occurs for example when writing to stdout with codepage 65001. */
    void test()
    {
    	HANDLE  out   = GetStdHandle(STD_OUTPUT_HANDLE);
    	char    str[] = { 0xC3, 0xA4, 0 }; // ä in UTF8
    	int     len   = strlen(str);
    	int     bytesWritten;
    
    	if (WriteFile(out, str, len, (DWORD*)&bytesWritten, NULL))
    		printf("\nbytes requested: %d, written: %d\n", len, bytesWritten);
    	else
    		printf("WriteFile failed\n");
    }
    
    void main(int argc, const char **argv)
    {
    	UINT cp = GetConsoleOutputCP();
    
    	SetConsoleOutputCP(850);
    	printf("Output codepage 850\n");
    	test();
    
    	SetConsoleOutputCP(65001);
    	printf("Output codepage 65001\n");
    	test();
    
    	SetConsoleOutputCP(cp);
    }
    

    Compile and run:

    D:\>cl -nologo -MD WriteFile_bug.c
    WriteFile_bug.c
    
    D:\>WriteFile_bug
    Output codepage 850
    +ñ
    bytes requested: 2, written: 2
    Output codepage 65001
    ä
    bytes requested: 2, written: 1
    
    D:\>

    With codepage 65001 the two bytes are collapsed to the correct umlaut, and 1 is returned in lpNumberOfBytesWritten.

     

    This behavior triggers bugs in the CRT because some functions expect the input and output size to be equal, or else they flag the stream with an error. Test program:

    #include <stdio.h>
    #include <errno.h>
    #include <io.h>
    #include <fcntl.h>
    
    #define WIN32_LEAN_AND_MEAN
    #include <windows.h>
    
    /* Some crt functions (like flush) check that the number of bytes written is
       equal to the number of bytes set as input. Since WriteFile returns the number
       of *characters* written those two values can be unequal. The output is correct,
       but  ferror(stdout)  returns non-zero and breaks programs relying on it.
       This happens for example with codepage 65001. */
    void test()
    {
    	char str[] = { 0xC3, 0xA4, 0 }; // ä in UTF8
    	int  len   = strlen(str);
    	int  charsWritten;
    	int  e1, e2, e3;
    
    	clearerr(stdout);
    	fputc(str[0], stdout);
    	e1 = ferror(stdout);
    
    	clearerr(stdout);
    	fputc(str[1], stdout);
    	e2 = ferror(stdout);
    
    	clearerr(stdout);
    	fflush(stdout);
    	e3 = ferror(stdout);
    
    	printf("\nferror on stdout: fputc: %d / %d, flush: %d\n", e1, e2, e3);
    }
    
    void main(int argc, const char **argv)
    {
    	UINT cp = GetConsoleOutputCP();
    	char buf[4096];
    	setvbuf(stdout, buf, _IOFBF, 4096);
    
    	SetConsoleOutputCP(850);
    	printf("Output codepage 850\n");
    	test();
    
    	SetConsoleOutputCP(65001);
    	printf("Output codepage 65001\n");
    	test();
    
    	SetConsoleOutputCP(cp);
    }
    

    Compile and run:

    D:\>cl -nologo -MD ferror_bug.c
    ferror_bug.c
    
    D:\>ferror_bug
    Output codepage 850
    +ñ
    ferror on stdout: fputc: 0 / 0, flush: 0
    Output codepage 65001
    ä
    ferror on stdout: fputc: 0 / 0, flush: 32
    
    D:\>

    With codepage 65001 WriteFile returns an unexpected number of bytes written and flush flags stdout with _IOERR.

     

    Another thing I noticed is that fwrite is quite useless when _O_WTEXT mode is set. It can crash in several ways and makes writing quite cumbersome without resorting to WriteFile/WriteConsole:

    #include <stdlib.h>
    #include <stdio.h>
    #include <io.h>
    #include <fcntl.h>
    
    /* fwrite can crash in several ways when the _O_WTEXT mode is set.
    
       _O_WTEXT only accepts even number of bytes when writing (e.g. wchar_t).
       So when size*count for fwrite is uneven it obviously crashes. But it can
       also happen with an even number of bytes:
    
         - when no buffer is attached,
         - or when a buffer is attached but is at an uneven offset.
    
       Calls to write eventually flush the buffer using _flsbuf (so does the first
       call to fwrite), which in turn writes the next byte to the stream with
       _write:
    
                           charcount = sizeof(TCHAR); // TCHAR = char for _flsbuf
           #ifndef _UNICODE
                           written = _write(fh, &ch, charcount); // charcount = 1
    
       But _write does not allow uneven sizes (1 byte in this case) with _O_WTEXT
       and crashes.
    
       Setting a buffer with setvbuf helps in most cases. If a big buffer
       (_IOFBF) is attached _flsbuf flushes the current buffer:
    
           charcount = (int)(stream->_ptr - stream->_base);
    
       But this difference *can* be uneven if a buffer is attached and filled
       with an uneven number of bytes before switching to _O_WTEXT. _IONBF uses a
       2-byte buffer, so _flsbuf is never called and fwrite writes directly using
       _write in multiples of 2.
     */
    void test()
    {
    	char str[] = { 0xC3, 0xA4, 0 }; // ä in UTF8
    	int  len   = strlen(str);
    	int  bytesWritten;
    
    	bytesWritten = fwrite(str, 1, len, stdout); // crashes
    	wprintf(L"bytes written: %d, ferror on stdout: %d\n", bytesWritten, ferror(stdout));
    }
    
    // crash without buffer attached, not even the 2-byte buffer used by _IONBF
    void test1() {
    	_setmode(_fileno(stdout), _O_WTEXT);
    	test();
    }
    
    // crash with big buffer at an uneven offset
    void test2() {
    	char buf[4];
    	setvbuf(stdout, buf, _IOFBF, 4);
    	fputc(' ', stdout);
    
    	_setmode(_fileno(stdout), _O_WTEXT);
    	test();
    }
    
    // no crash with big buffer at even offset
    void test3() {
    	char buf[4];
    	setvbuf(stdout, buf, _IOFBF, 4);
    
    	_setmode(_fileno(stdout), _O_WTEXT);
    	test();
    }
    
    // no crash with 2-byte buffer
    void test4() {
    	char buf[4];
    
    	_setmode(_fileno(stdout), _O_WTEXT);
    	setvbuf(stdout, NULL, _IONBF, 0);
    	test();
    
    	_setmode(_fileno(stdout), _O_TEXT);
    	setvbuf(stdout, NULL, _IONBF, 0);
    	fputc(' ', stdout);
    	_setmode(_fileno(stdout), _O_WTEXT);
    	test();
    }
    
    int main(int argc, const char **argv)
    {
    	int t = argc > 1 ? atoi(argv[1]) : 4;
    
    	switch (t) {
    	case 1: test1(); break;
    	case 2: test2(); break;
    	case 3: test3(); break;
    	case 4: test4(); break;
    	}
    
    	return 0;
    }
    

    Compile and run it. The first two tests crash while the last two don't due to carefully set up buffers.

     

    I tested all those on an up-to-date Windows 7 Professional machine.

    MSVCR90.DLL: 9.0.30729.4926

    KERNEL32.DLL: 6.1.7600.16385

    Saturday, March 20, 2010 4:33 PM

Answers

All replies

  • > First is that WriteFile seems to return the number of characters
    > written, instead of the number of bytes. I'm not sure whether this is a
    > bug with WriteFile, or with its documentation, or something else. Test
    > program:
    >...
    > D:\>WriteFile_bug
    > Output codepage 850
    > +ñ
    > bytes requested: 2, written: 2
    > Output codepage 65001
    > ä
    > bytes requested: 2, written: 1

    Under Windows 7, I get:

    Output codepage 850
    ä
    bytes requested: 2, written: 2
    Output codepage 65001
    ä
    bytes requested: 2, written: 2

    > Another thing I noticed is that fwrite is quite useless when _O_WTEXT
    > mode is set. It can crash in several ways and makes writing quite
    > cumbersome without resorting to WriteFile/WriteConsole:

    In testing with VS2008 there does appear to be some odd validation in
    write.c. The comment is out of step with the UTF8 allowance to my mind:

    if(tmode == __IOINFO_TM_UTF16LE ||
    tmode == __IOINFO_TM_UTF8)
    {
    /* For a UTF-16 file, the count must always be an even
    number */
    _VALIDATE_CLEAR_OSSERR_RETURN(((cnt & 1) == 0), EINVAL, -1);
    }

    If I skip around that, it doesn't crash for me. How about you?

    Dave
    Saturday, March 20, 2010 6:40 PM
  • Output codepage 850
    ä
    bytes requested: 2, written: 2
    Output codepage 65001
    ä
    bytes requested: 2, written: 2

    I tried it again, this time with different console settings. I get correct behavior with cmd.exe when using the old bitmap font (obviously, since it cannot display ä, and thus displays those two raw bytes). With Lucida Console or Consolas I get the other behavior: ä, with 1 as lpNumberOfBytesWritten.

    In testing with VS2008 there does appear to be some odd validation in
    write.c. The comment is out of step with the UTF8 allowance to my mind:

    if(tmode == __IOINFO_TM_UTF16LE ||
    tmode == __IOINFO_TM_UTF8)
    {
    /* For a UTF-16 file, the count must always be an even
    number */
    _VALIDATE_CLEAR_OSSERR_RETURN(((cnt & 1) == 0), EINVAL, -1);
    }

    If I skip around that, it doesn't crash for me. How about you?

    That's exactly the spot where it crashes (by enforcing writing an even number of bytes). Obviously it doesn't crash if the check is skipped. And I agree that this seems wrong for __IOINFO_TM_UTF8. But for __IOINFO_TM_UTF16LE (which is set by _O_WTEXT) this seems ok since _O_WTEXT expects to output wchar_t (same reason why printf and fputc crash early when used with _O_WTEXT). And the problem is that higher-level fwrite/_flsbuf don't honor this convention and try to flush a single byte sometimes.

    Saturday, March 20, 2010 7:04 PM
  • > I tried it again, this time with different console settings. I get
    > correct behavior with cmd.exe when using the old bitmap font (obviously,
    > since it cannot display ä, and thus displays those two raw bytes). With
    > Lucida Console or Consolas I get the other behavior: ä, with 1 as
    > lpNumberOfBytesWritten.

    OK, confirmed.

    I'm not sure how you'll progress this further - you may have to ring MS
    support.

    Dave
    Saturday, March 20, 2010 11:25 PM
  • > I'm not sure how you'll progress this further - you may have to ring MS
    > support.

    Having said that, the issue with the CRT also exists with VS2010 RC, so
    you could try submitting a bug report on the Connect web site against VS
    and see what response you get there. If you do that, please post a link
    here to your bug report.

    Dave
    Saturday, March 20, 2010 11:40 PM
  • I submitted a bug report on Connect .
    • Marked as answer by Wesley Yao Friday, March 26, 2010 6:16 AM
    Monday, March 22, 2010 9:51 PM
  • BTW: accidentally, independently, just now, I am also experiencing the same issue and I was just looking for some answers.

    IMHO: it is a bug in implementation of WriteFile() function when writing to console.

    In my case I have no workaround for that. I am receiving data in UTF-8. I don't know anything more about the content and I just need to write them to HANDLE, which may happen to be console. The code, therefore, must check for number of bytes written and if not all was written yet, write the rest. To implement workouround I would have to parse that UTF-8 stream and count number of characters myself, and only if the handle is handle to console. That is ridiculous.

     

    Thank you very much for filing the bug report.

     

    However, there are more issues with console and code page 65001. ReadFile() also has an issue. If I call ReadFile() on handle to console input and cut and paste UNICODE text to console window, the ReadFile() will return in the middle of the string with indication of EOF! That means it returns precisely what MSDN says about the EOF:

     

    "When a synchronous read operation reaches the end of a file, ReadFile returns TRUE and sets *lpNumberOfBytesRead to zero."

     

    However, there certainly, was not any EOF occuring since after my application completes the remaining characters are pasted into the cmd.exe.

     

    Tuesday, March 23, 2010 8:03 AM