locked
Unicode string handling routines in C runtime libraries - 0xFFFF characters RRS feed

  • Question

  • I have an application that scans file systems and allows users to report on their contents.  I have one user whose file system contains objects - both directories and files - with names that include the Unicode code point 0xFFFF.  I find it surprising that the file system accepts these names but I have confirmed with a small program that there is no problem creating objects with 0xFFFF characters in their names. 

    The real subject of this question relates to using the wide character functions wcscpy(), wcscat(), swprintf(), etc. to manipulate these names once I have retrieved them from the file system with FindNextFile().  In all cases, it appears that the wide character functions treat the 0xFFFF character as a terminator (like NULL) and truncate the operation. 

    Questions:

    Is this a bug in the string handling library?

    Is there a Visual Studio parameter or some other setting that would modify this behavior?

    Why does NTFS accept these characters in filenames?  Is this a file system bug?

    Friday, August 15, 2014 5:12 AM

Answers

  • "since you appear to have an inside track with the Redmond crew"

    Not really, the source code for the VC++ runtime is provided with the product. The detail about sprintf using a fake FILE object is also known from a blog post made by a member of the VC++ team: http://blogs.msdn.com/b/vcblog/archive/2014/06/10/the-great-crt-refactoring.aspx

    "Beginning with VS12, the VS debugger does not display Unicode strings with a cursor hover or in the Watch windows unless you follow their names (or their containing structure names) with ',su'."

    I don't have VS12 to test but the following C code works fine for me in VS2013:

    wchar_t *p = L"foo";
    wchar_t a[100];
    wcscpy_s(a, 100, p);
    

    Both a and p display correctly in the debugger.

    "Concerning the file system accepting \uFFFF, this does not seem logical since this code point is clearly defined in the Unicode standard as 'not a character'."

    Maybe, but it's too late to change that now. It's quite interesting that control characters in the range 1-31 are banned but \uFFFF (and probably others) are not.

    • Marked as answer by Shu 2017 Thursday, August 21, 2014 10:44 AM
    Friday, August 15, 2014 1:53 PM

All replies

  • Interesting, a related question was posted on this forum about a week ago: http://social.msdn.microsoft.com/Forums/en-US/bf08f060-0261-43ad-bade-c2797f4fb3a1/the-problem-about-cstring-0xffff?forum=vcgeneral

    I'm not aware of any problems related to 0xFFFF and wcscpy and wcscat, they should work fine. swprintf is the one causing problems. To add some more details to what I said in the other thread:

    VC++'s implementation of printf & co. relies of a common function which outputs the string to a FILE. swprintf like functions create a fake FILE object which is not associated to an actual file and has its buffer set to the buffer specified in the swprintf call. The common function then uses putwc to write characters to that FILE object and this is where trouble starts, putwc returns the character that was put or WEOF to indicate and error. WOF happens to have the value 0xffff and the callers of putwc have no way to distinguish between an error and the more or less valid character \uFFFF.

    I suppose this could be considered as a bug in the printf family of functions. It's basically a case of leaking implementation details because there's no requirement for printf to use putc/putwc. putwc/WEOF are also to blame because unlike putc they use an actual character code.

    This might have been fixed in VS14, currently in preview. I know that they rewrote much of the code involved in formatting and the FILE trick is no longer used.

    As for a filesystem bug, probably not. Filesystems tend to be conservative in banning characters, they ban certain characters such as \ < and > but not much more. NTFS is also known to allow NUL characters so and that's worse than \uFFFF.

    Friday, August 15, 2014 6:06 AM
  • It is also quite possible that they don't care too much about 0xffff because it is a reserved code point and not a valid character. As the Unicode standard says

    Noncharacters
    These codes are intended for process-internal uses.
    FFFE <not a character>
    • may be used to detect byte order by contrast with FEFF
    → FEFF zero width no-break space
    FFFF <not a character>

    Since 0xffff isn't a valid character in the execution wide character set, that would mean it is implementation defined how it is treated.


    This is a signature Any samples given are not meant to have error checking or show best practices. They are meant to just illustrate a point. I may also give inefficient code or introduce some problems to discourage copy/paste coding. This is because the major point of my posts is to aid in the learning process.

    Friday, August 15, 2014 6:45 AM
  • Many thanks for the quick reply.  I agree that wcscpy, wcscat, etc. work correctly - my earlier observation was due to a silly pointer mistake. 

    Hopefully, the swprintf bug will be fixed in VS14 as you indicate.  This problem also exists in the StringCbPrintf() and StringCchPrintf() functions which, I assume, use the same underlying code.

    While discussing Visual Studio, perhaps you can shed some light on another annoyance since you appear to have an inside track with the Redmond crew.  Beginning with VS12, the VS debugger does not display Unicode strings with a cursor hover or in the Watch windows unless you follow their names (or their containing structure names) with ',su'.  Ordinary ASCII strings display fine.  In a response to a previous post in this forum, I got the reply 'set the project parameter 'Compile as:' to C++ (/TP).  This does fix the problem but is also an annoyance if I do not want function name decoration or other C++ side effects.  The person who responded earlier also said 'perhaps fixed in a later rev'.  Can you comment?

    Concerning the file system accepting \uFFFF, this does not seem logical since this code point is clearly defined in the Unicode standard as 'not a character'.

    Friday, August 15, 2014 11:59 AM
  • "since you appear to have an inside track with the Redmond crew"

    Not really, the source code for the VC++ runtime is provided with the product. The detail about sprintf using a fake FILE object is also known from a blog post made by a member of the VC++ team: http://blogs.msdn.com/b/vcblog/archive/2014/06/10/the-great-crt-refactoring.aspx

    "Beginning with VS12, the VS debugger does not display Unicode strings with a cursor hover or in the Watch windows unless you follow their names (or their containing structure names) with ',su'."

    I don't have VS12 to test but the following C code works fine for me in VS2013:

    wchar_t *p = L"foo";
    wchar_t a[100];
    wcscpy_s(a, 100, p);
    

    Both a and p display correctly in the debugger.

    "Concerning the file system accepting \uFFFF, this does not seem logical since this code point is clearly defined in the Unicode standard as 'not a character'."

    Maybe, but it's too late to change that now. It's quite interesting that control characters in the range 1-31 are banned but \uFFFF (and probably others) are not.

    • Marked as answer by Shu 2017 Thursday, August 21, 2014 10:44 AM
    Friday, August 15, 2014 1:53 PM