none
Character Encoding Confusion RRS feed

  • Question

  • I have a Vb.Net application that works in conjunction with a C++ app. The C++ app writes the contents of a printer DEVMODE structure to a file as bytes. The VB.Net app reads the DEVMODE from that file into a VB.Net structure using a BinaryReader. In the DEVMODE are two character array fields, device name and formname. I am expecting (and looking at  the file seems correct) these character fields are written out as 2-byte characters, Unicode/UTF-16. Now in order to read these fields from the file in VB.Net, I use the ReadChars function of the BinaryReader I created. Now I expect that I should create the BinaryReader with encoding Unicode (which is UTF-16) and call ReadChars(32) since the fields are 32 characters fixed length.

    If I do this, the char() returned by the ReadChars function is not correct. In order to get the correct char() returned, I have to set the BinaryReader encoding to UTF-8 and call ReadChars(64). This makes  no sense to me reading  the documentation. The DEVMODE character fields are defined as TCHAR or WCHAR which should be Unicode/UTF-16 and I should call ReadChars with 32 character length and the ReadChars function should return the correct char array and advance 2 bytes per character. I can't figure out how UTF-8 works with a UTF-16 field and given I used UTF-8 as the encoding, why would ReadChars need to be told to read 64 characters when there are only 32 characters (written as 2 byte chars for 64 total bytes).

    I am  just not getting what is going on here.

    Wednesday, April 16, 2014 7:48 PM

Answers

  • My mistake altogether. I found the problem in my code. Much earlier I made an incorrect assumption about how all this works and the code that converts the char() returned by readchars to a string was incorrect. So I had a mismatch in interpretations of what was in the char(). I kept thinking I was doing it wrong, but could  not quite put it  together. Thanks for being a sounding board. Thats what it takes sometimes :-)
    Wednesday, April 16, 2014 9:41 PM

All replies

  • Net library strings and char are two byte long with a private property that indicates if each character is one or two bytes wide.  The encoding methods are used to read/set the internal property.

    Now if your set the binary reader to Unicode encoding it will read two bytes at a time and insert the results into one char.  When you read 32 characters it should be reading 64 bytes since characters in the Net Library are two bytes wide.

    Now if you are getting the wrong results the byte order may be backwards and you may have to switch to either Big Endian or Little Endian.


    jdweng

    Wednesday, April 16, 2014 8:27 PM
  • Well thats what I though but it does not work. Windows is bigendian, so for most chars the code is in the first byte and the second byte is a null. So I open the binary reader with encoding.BigEndianUnicode or encoding.Unicode or encoding.UTF-16 and then call readchar(32) into Mychararray(). I then trace the value by writing the result to a trace file as as a string:  new string(mychararray). The string is total garbage if I use encoding.bigendian. If I use encoding.unicode or utf-16, the string is incorrect but some characters from the source are regonizeable. Whats also interesting is if I set the reader encoding to utf-8, it acts as if I am just reading bytes (length=64) and it works just like I was reading bytes into a char().

    I could just read bytes as others have suggested and convert to char(), but it seems to me that t he point of specifying the encoding on binaryreader and using readchars is to have .Net intelligently process the byte stream instead of me having to do it.

    Wednesday, April 16, 2014 9:14 PM
  • My mistake. Windows is little endian which explains why switching my code to big produced worse results. So back to the orginal question, if I write out a string of 32 characters that are in the buffer I write as C++ TCHAR/WCHAR, I should  be able to read them into a char() in vb.net by using a binaryreader with encoding.unicode and readchars(32) just as you suggested in your reply.
    Wednesday, April 16, 2014 9:24 PM
  • My mistake altogether. I found the problem in my code. Much earlier I made an incorrect assumption about how all this works and the code that converts the char() returned by readchars to a string was incorrect. So I had a mismatch in interpretations of what was in the char(). I kept thinking I was doing it wrong, but could  not quite put it  together. Thanks for being a sounding board. Thats what it takes sometimes :-)
    Wednesday, April 16, 2014 9:41 PM