locked
How to decode UTF 16 Big Endian content RRS feed

  • Question

  • Hi,

    From a third party provider i receive some text in a query string (HTTP GET). I need to store this text in the database and then display on the UI. The text contains japanese characters and i was told that the encoding used is UTF 16 Big Endian. Now, how i need to retrieve the content from query string and store it in the database? When i am using Request.QueryString and storing in the database all the content is stored as strange characters - '0o0D0e�g,��0g��O�0W0~0Y0'.

    Can someone guide me in this please?

    Thanks

    Thursday, October 27, 2011 7:58 AM

Answers

  • I think you are fishing for lines of code without actually understanding what is going on.  You need to start with what you have and then work toward what you want.

    First, you are getting a "string" from a query string from an HTTP GET.  An HTTP GET request is a transmission of bytes from a client to a server.  We typically represent these bytes as characters so that we can type them in (and work with them easily in our programs), but in reality, it's just bytes.

    So the first step you need to do perform is to translate what your .NET API gives you (a string object) into its raw form (its bytes). 

    You say that your query string is coming in as "'0o0D0e�g,��0g��O�0W0~0Y0", but that's just the way one particular character encoding interprets the actual bytes transmitted in an HTTP GET.  You were told that the actual characters you want were encoded into bytes using UTF16BE.  The reason that you are seeing strange characters is because the software that you are "seeing" these characters is interpreting these bytes as something other than UTF16BE.

    So, first, you need to figure out which encoding the .NET base class library has used to transform the raw bytes from the HTTP GET request into the .NET strings you can work with as query string parameters.  I am not 100% sure about this, but it's probably UTF-8. 

    This means that the first thing you need to do is reverse .NET's automatic encoding of the HTTP GET request into UTF-8:

    string queryStringParamter = ....
    byte[] binary = Encoding.UTF8.GetBytes(queryStringParameter);
    

    Now that you have the binary, if you want to transform this back into the original characters that were transmitted in UTF16BE, you need to reencode the binary using UTF16BE, like this:

    string original = Encoding.BigEndianUnicode.GetString(binary);<br/>
    
    This is essentially what Louis was trying to say, and it should give you the correct results.
    Tuesday, November 1, 2011 8:38 PM

All replies

  • You can use Encoding.BigEndianUnicode to get a string from an UTF-16 big endian byte array:

    string mystring = Encoding.BigEndianUnicode.GetString(mybytes);

    Thursday, October 27, 2011 10:03 AM
  • Hi Louis,

    Thanks for your reply. Can you please elaborate how this can be done in case of a query string, as i said i need to get the content from query string. It would be really help full if you can post full code that does this job.

    Thanks

    Thursday, October 27, 2011 10:11 AM
  • First, you need to get the originally sent bytes, then reencode it in a string:

    static string CorrectBadQueryString(string badString)
    {
            return Encoding.BigEndianUnicode.GetString(Encoding.Unicode.GetBytes(badString));
    }
    

     

    Thursday, October 27, 2011 2:44 PM
  • Hi Louis,

    Just to have the complete code, in my case as i am getting from query string, the code should be as below

     

    string goodString = Encoding.BigEndianUnicode.GetString(Encoding.Unicode.GetBytes(Request.QueryString("content")));

    where my Request.QueryString("content") will have the bad string value, in this case '0o0D0e�g,��0g��O�0W0~0Y0'. Am i correct? Or should I need to perform anything?

    Thanks much

    Thursday, October 27, 2011 4:21 PM
  • You need reencode your text contains japanese characters before you use Request.QueryString().

    Tuesday, November 1, 2011 5:56 AM
  • Hi,

    Can you post some sample code for doing this?

    Many thanks

    Tuesday, November 1, 2011 8:11 AM
  • Before passing parameter you need use Server.UrlEncode(japanese characters).ToString, then you can use Request.QueryString().

    Tuesday, November 1, 2011 1:20 PM
  • Well, the query string is passed by other third party provider, we don't have the control on how the query string is been passed. We need to accept the value that has been sent to us and do proper conversion.

    Thanks

    Tuesday, November 1, 2011 1:26 PM
  • Try this: System.Web.HttpUtility.UrlDecode(Request.QueryString("content"),System.Text.UnicodeEncoding.GetEncoding("UTF-16"))
    Tuesday, November 1, 2011 2:12 PM
  • So, the complete code will be somewhat like below

                string unicodeContent = Encoding.Unicode.GetString(Encoding.BigEndianUnicode.GetBytes(System.Web.HttpUtility.UrlDecode("0o0D0e�g,��0g��O�0W0~0Y0", System.Text.UnicodeEncoding.BigEndianUnicode)));

    Is this correct?

    But still the output is as '"漰䐰İ﷿﷿Ⱨ﷿﷿朰﷿﷿﷿﷿地縰夰Ȱ"', which i believe is incorrect.

    Thanks

    Tuesday, November 1, 2011 2:27 PM
  • Don't use Encoding, just try like this:
    string unicodeContent=System.Web.HttpUtility.UrlDecode(Request.QueryString("content"),System.Text.UnicodeEncoding.GetEncoding("UTF-16"))

    • Edited by nuovoxx Tuesday, November 1, 2011 3:00 PM
    Tuesday, November 1, 2011 2:59 PM
  • Tried this

    string unicodeContent = System.Web.HttpUtility.UrlDecode("0o0D0e�g,��0g��O�0W0~0Y0", System.Text.UnicodeEncoding.GetEncoding("UTF-16"));

    But the output still doesn't seem to be proper - "漰䐰İ��Ⱨ��朰����地縰夰Ȱ"

    Tuesday, November 1, 2011 3:01 PM
  • I think you are fishing for lines of code without actually understanding what is going on.  You need to start with what you have and then work toward what you want.

    First, you are getting a "string" from a query string from an HTTP GET.  An HTTP GET request is a transmission of bytes from a client to a server.  We typically represent these bytes as characters so that we can type them in (and work with them easily in our programs), but in reality, it's just bytes.

    So the first step you need to do perform is to translate what your .NET API gives you (a string object) into its raw form (its bytes). 

    You say that your query string is coming in as "'0o0D0e�g,��0g��O�0W0~0Y0", but that's just the way one particular character encoding interprets the actual bytes transmitted in an HTTP GET.  You were told that the actual characters you want were encoded into bytes using UTF16BE.  The reason that you are seeing strange characters is because the software that you are "seeing" these characters is interpreting these bytes as something other than UTF16BE.

    So, first, you need to figure out which encoding the .NET base class library has used to transform the raw bytes from the HTTP GET request into the .NET strings you can work with as query string parameters.  I am not 100% sure about this, but it's probably UTF-8. 

    This means that the first thing you need to do is reverse .NET's automatic encoding of the HTTP GET request into UTF-8:

    string queryStringParamter = ....
    byte[] binary = Encoding.UTF8.GetBytes(queryStringParameter);
    

    Now that you have the binary, if you want to transform this back into the original characters that were transmitted in UTF16BE, you need to reencode the binary using UTF16BE, like this:

    string original = Encoding.BigEndianUnicode.GetString(binary);<br/>
    
    This is essentially what Louis was trying to say, and it should give you the correct results.
    Tuesday, November 1, 2011 8:38 PM
  • Thanks Evan, that clearly explains what I am looking for, much appreciated.

    In case of Louis code, the GetBytes method is been called from Unicode encoding, as per you I need to use the GetBytes of UTF8. I will implement this and let you guys know the result.

    Many thanks

     

    Wednesday, November 2, 2011 3:53 AM