Unicode encoding endianness or something

Answered Unicode encoding endianness or something

  • Saturday, March 02, 2013 12:59 PM
     
     

    There is a script which make me think that I don't quiet understand how Unicode encoding works in ms sql:

    SELECT CAST( N'z' as binary(2) );
    SELECT CAST( 0x7A00 as nchar(10) );
    SELECT CAST(UNICODE('z') as binary(2));
    SELECT CAST( 0x007A as nchar(10) );

    Why the results of 1 and 3 selects are different? Unicode tables says the code of 'z' is 0x007A, why I get 0x7A00 too?

All Replies

  • Saturday, March 02, 2013 3:20 PM
    Moderator
     
      Has Code

    Hi Max,

    It is a puzzle. If I have to make a guess, the first cast is wrong: it flips the high & low bytes:

    SELECT CAST( N'z' as binary(2) );				-- 0x7A00
    SELECT CAST(UNICODE(N'稀') as binary(2));		-- 0x7A00
    SELECT CAST(UNICODE(N'z') as binary(2));		-- 0x007A

    One thing though, in the first cast, you are not casting to hex, you are only casting to "binary" so anything goes as to say.


    Kalman Toth Database & OLAP Architect
    Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012



  • Saturday, March 02, 2013 4:03 PM
     
      Has Code

    Hi Max,

    It is a puzzle. If I have to make a guess, the first cast is wrong: it flips the high & low bytes:

    SELECT CAST( N'z' as binary(2) );				-- 0x7A00
    SELECT CAST(UNICODE(N'稀') as binary(2));		-- 0x7A00
    SELECT CAST(UNICODE(N'z') as binary(2));		-- 0x007A

    One thing though, in the first cast, you are not casting to hex, you are only casting to "binary" so anything goes as to say.


    Kalman Toth Database & OLAP Architect
    Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012



    Hi

    Yeah. It seems like MSSQL switch byte order(endianness) of characters when it works with them http://en.wikipedia.org/wiki/Byte_order_mark but why, it is suppose to be so or I understand it wrong?




  • Saturday, March 02, 2013 5:33 PM
     
     Answered Has Code

    You are seeing a combination of the effects of little endian and the way the convert and cast functions work.

    Yes, Windows is little endian.  So, the bytes are reversed when they are stored.  So N'z' which is a value of 0x007A will be stored as 7A00 in the actual data page on disk.  Also, if you have a integer column with the value of 122 (or 7 times 16 + 10), that is also a value of 0x007A, so it is also stored as 0x7A00.  You can verify this if you examine the data pages on disk (for a description of how to do this, SQL Server 2008 Internals by Kalen Delaney, et al does a great job or a good online source is Paul Randal's blog entry at http://blogs.msdn.com/b/sqlserverstorageengine/archive/2006/12/13/more-undocumented-fun_3a00_-dbcc-ind_2c00_-dbcc-page_2c00_-and-off_2d00_row-columns.aspx.

    So, if you do

    create table foo(i int, v nchar(1));
    insert foo(i, v) Values(122, N'z');

    then both i and v will have 0x7A00 stored on the disk.  But if you then do

    select i, v, cast(i as binary(2)), cast(v as binary(2)) from foo;

    the result will be

    122         z    0x007A 0x7A00

    As you can see, when using cast to convert an integer to or from binary, the byte flip is undone, but if you use cast to convert an nchar to or from binary, it does NOT un-flip the bytes.  Why Microsoft did that, I don't know, but that's the way it works.

    So when you convert N'z' by casting it as binary, you get 0x7A00.  But when you use the UNICODE function, the definition of that function is that it returns the "integer value as defined by the Unicode standard".  So UNICODE(N'z') returns an integer value of 122 - which will be stored as 0x7A00 because Windows is little endian, but when you cast it to binary, since you are now casting an integer, not a nchar, the bytes are un-flipped and you get 0x007A.  But if you now take the value 0x007A and cast it as nchar, the bytes are not flipped, So you end up with an nchar containing 0x007A and when SQL evaluates that it considers it to be a little endian value, so it is 7*16 cubed + 10*16 squared + 0*16 + 0 or 31,232.  And the unicode character corresponding to 31,232 is 稀.

    The only thing Microsoft guarantees is that round trips in conversion will always return the same value, that is, for example, if you cast some datatype to binary and then cast that binary value back to the same type, the final value will be the same as the original value.  So your first line casts a nchar to binary, and then takes that binary value and casts it to nchar.  So you get back your original value (N'z').

    But the third line of code uses UNICODE to change the N'z' to an int.  Then you convert that int to binary, then you convert that binary to an nchar.  And you are not guaranteed to get the same value and, as you have observed, you don't.  If you wanted to make a round trip out of it, you need to convert the binary back to an int and then use NCHAR() to convert the int to the corresponding unicode character.  So you would do

    SELECT CAST( N'z' as binary(2) );
    SELECT CAST( 0x7A00 as nchar(1) );
    SELECT CAST(UNICODE('z') as binary(2));
    SELECT NCHAR( CAST(0x007A as int) );

    Now both of those are true round trips and they both return the original value N'z'.

    Tom

    • Marked As Answer by Max Nevermind Saturday, March 02, 2013 7:29 PM
    •  
  • Saturday, March 02, 2013 7:28 PM
     
      Has Code

    You are seeing a combination of the effects of little endian and the way the convert and cast functions work.

    Yes, Windows is little endian.  So, the bytes are reversed when they are stored.  So N'z' which is a value of 0x007A will be stored as 7A00 in the actual data page on disk.  Also, if you have a integer column with the value of 122 (or 7 times 16 + 10), that is also a value of 0x007A, so it is also stored as 0x7A00.  You can verify this if you examine the data pages on disk (for a description of how to do this, SQL Server 2008 Internals by Kalen Delaney, et al does a great job or a good online source is Paul Randal's blog entry at http://blogs.msdn.com/b/sqlserverstorageengine/archive/2006/12/13/more-undocumented-fun_3a00_-dbcc-ind_2c00_-dbcc-page_2c00_-and-off_2d00_row-columns.aspx.

    So, if you do

    create table foo(i int, v nchar(1));
    insert foo(i, v) Values(122, N'z');

    then both i and v will have 0x7A00 stored on the disk.  But if you then do

    select i, v, cast(i as binary(2)), cast(v as binary(2)) from foo;

    the result will be

    122         z    0x007A 0x7A00

    As you can see, when using cast to convert an integer to or from binary, the byte flip is undone, but if you use cast to convert an nchar to or from binary, it does NOT un-flip the bytes.  Why Microsoft did that, I don't know, but that's the way it works.

    So when you convert N'z' by casting it as binary, you get 0x7A00.  But when you use the UNICODE function, the definition of that function is that it returns the "integer value as defined by the Unicode standard".  So UNICODE(N'z') returns an integer value of 122 - which will be stored as 0x7A00 because Windows is little endian, but when you cast it to binary, since you are now casting an integer, not a nchar, the bytes are un-flipped and you get 0x007A.  But if you now take the value 0x007A and cast it as nchar, the bytes are not flipped, So you end up with an nchar containing 0x007A and when SQL evaluates that it considers it to be a little endian value, so it is 7*16 cubed + 10*16 squared + 0*16 + 0 or 31,232.  And the unicode character corresponding to 31,232 is 稀.

    The only thing Microsoft guarantees is that round trips in conversion will always return the same value, that is, for example, if you cast some datatype to binary and then cast that binary value back to the same type, the final value will be the same as the original value.  So your first line casts a nchar to binary, and then takes that binary value and casts it to nchar.  So you get back your original value (N'z').

    But the third line of code uses UNICODE to change the N'z' to an int.  Then you convert that int to binary, then you convert that binary to an nchar.  And you are not guaranteed to get the same value and, as you have observed, you don't.  If you wanted to make a round trip out of it, you need to convert the binary back to an int and then use NCHAR() to convert the int to the corresponding unicode character.  So you would do

    SELECT CAST( N'z' as binary(2) );
    SELECT CAST( 0x7A00 as nchar(1) );
    SELECT CAST(UNICODE('z') as binary(2));
    SELECT NCHAR( CAST(0x007A as int) );

    Now both of those are true round trips and they both return the original value N'z'.

    Tom

    Thanks for the answer.

    Funny, I've just started learning MS SQL and decided to think up some examples for the cast function wich ended in how I can read data pages from a disk.