none
[MS-TDS] Section 2.2: What does "Unicode" mean in this context? RRS feed

  • Question

  • The TDS specification says, in section 2.2:

    "All character data within a TDS message is in UNICODE."

    What does this mean?

    I understand one aspect of what it is meant to mean: that the code points used for characters are those from the Unicode standard.

    I assume it is also trying to say something about the representation of these codes on the wire. The strings "UTF" and "UCS" do not seem to appear in the spec, so it is difficult to tell what the actual encoding is.

    Is the actual format:
     * UCS-2 (in which case only a subset of Unicode can be used, but it is easy to work out byte lengths from character counts)
     * UTF-16


    Thursday, June 26, 2008 1:35 PM

Answers

  • The endocing is indeed UTF16LE. Sorry about the typo. And I agree - it is 16-bits only within the BMP; 32-bit surrogates are certainly part of the possible code points.

    I will research this further and get back to you.

    Thanks for your patience!



    Escalation Engineer
    • Marked as answer by KeithHa Wednesday, August 13, 2008 11:16 PM
    Tuesday, July 1, 2008 9:29 AM
  •  

    Good morning Paul - I am currently (still) waiting for devlopment response to your question about character encoding in MS-TDS. I suspect the protocol itself is transparent with repect to UCS-2 and UTF-16; I cannot confirm that, however.

    Thanks for your patience - also, I have added several references concerning character encoding support in various versions of SQL server.

    ===============================================
    Description of storing UTF-8 data in SQL Server
    http://support.microsoft.com/kb/232580

    SQL Server 7.0 and SQL Server 2000 use a different Unicode encoding (UCS-2) and do not recognize UTF-8 as valid character data.

    ===============================================
    XML Support in Microsoft SQL Server 2005
    http://msdn.microsoft.com/en-us/library/ms345117.aspx

    SQL Server 2005 stores XML data as Unicode (UTF-16). XML data retrieved from the server comes out in UTF-16 encoding as well.

    Regards,
    Bill Wesse


    Escalation Engineer
    • Marked as answer by KeithHa Wednesday, August 13, 2008 11:17 PM
    Tuesday, July 15, 2008 11:29 AM
  •  Good morning Paul. Thanks for your patience. I now have an answer for your question '[MS-TDS] Section 2.2: What does "Unicode" mean in this context?'. We have modified the [MS-TDS] Glossary (1.1) to include a definition for Unicode (in this case, UCS-2).

    [MS-TDS]

    1.1 Glossary
    The following terms are defined in [MS-GLOS]:
    ...
    Added:
    Unicode:  The set of characters as defined by [UNICODE] that are encoded in UCS-2.
    ...

    Regards,
    Bill Wesse


    Escalation Engineer
    Thursday, July 31, 2008 9:39 AM

All replies

  •  Paul - thank you for your inquiry (and sorry about the boilerplate text, I may not be the team member that takes ownership). One of our Protocol Support Team members will be in touch with you soon concerning this.

    Regards,
    Bill Wesse, MCSE / Escalation Engineer, US-CSS DSC PROTOCOL TEAM


    Escalation Engineer
    Thursday, June 26, 2008 7:25 PM
  • Paul - I have the answer for you, below. Please let me know if this meets your needs. I will be happy to elaborate further if you wish!

    Regards, Bill Wesse

    Question:
    =========
    [MS-TDS] Section 2.2: What does "Unicode" mean in this context?

    Reference:
    ==========
    2.2 Message Syntax
    Character data, such as T-SQL statements, within a TDS message is in Unicode, unless the character data represents the data value of an ASCII data type, such as a non-Unicode data column. Character counts within TDS are a count of characters, rather than bytes, except when explicitly specified as byte counts.


    Answer:
    =======
    This generally refers to UTF-15 Little-Endian format.

    Reference: [MS-GLOS].pdf

    Unicode:
    ...
    In this specification, all references to Unicode refer to a single Unicode character or an array of Unicode characters using the 16-bit UTF-16 form of the encoding. In this specification, when arrays of Unicode characters are defined, details are included that indicate if the array of Unicode characters is null-terminated.
    ...

    Download locations
    ==================

    The PDF documents (including [MS-GLOS].pdf are availabe in both the 'Windows Communication Protocols Downloads' and 'Windows Server Protocols Downloads' sections near the bottom of the following page:

    Windows Open Protocols
    http://msdn.microsoft.com/en-us/windowsvista/cc297276.aspx

    Links on this page:
    Windows Communication Protocols (MCPP) Technical Documentation (.zip file)
    Windows Server Protocols (WSPP) Technical Documentation (.zip file)


    Escalation Engineer
    Friday, June 27, 2008 9:58 AM
  • Bill

    Thanks for your response.

    I am no clearer now:

    1) "UTF-15 Little-Endian format" - what is this? Not defined in Unicode, not in the Microsoft glossary. Is it a mistake for UTF-16?

    2) As for the glossary reference: the following text "A default 16-bit, fixed-width form called UTF-16" in the glossary is wrong: UTF-16 is *not* a fixed-width form. In UTF-16, single Unicode code points (characters) are represented using either 16 or 32 bits of data.

    My question remains: does the TDS protocol use:
     * UCS-2
     * UTF-16
     * some other encoding
    to represent Unicode code points (characters) on the wire.

    This information is needed to make sense of "Character counts within TDS are a count of characters...". If all characters are certain to be 16 bits (as in UCS-2), that's one thing. If characters may be 16 bits or 32 bits (as in UTF-16), that's another. In the first case, you can multiply the number of characters by 2 to get the number of bytes. In the second, you have to examine each 16-bit value to see whether it describes a single character or half a character.

    The other possibility is that when the TDS protocol talks about a count of "characters", it really means "16 bit units". If this is true, then it should say that. I can easily work with that.

    By the way, the reference to the glossary is not really a satisfactory way to define which of the encoding forms is used in a protocol. An encoding should be explicitly specified for TDS, such as "UTF-16LE".

    As well as this information from Unicode:
    http://www.unicode.org/faq/utf_bom.html#39

    the Wikipedia page may be of use:
    http://en.wikipedia.org/wiki/UTF-16


    Friday, June 27, 2008 1:37 PM
  • The endocing is indeed UTF16LE. Sorry about the typo. And I agree - it is 16-bits only within the BMP; 32-bit surrogates are certainly part of the possible code points.

    I will research this further and get back to you.

    Thanks for your patience!



    Escalation Engineer
    • Marked as answer by KeithHa Wednesday, August 13, 2008 11:16 PM
    Tuesday, July 1, 2008 9:29 AM
  •  

    Good morning Paul - I am currently (still) waiting for devlopment response to your question about character encoding in MS-TDS. I suspect the protocol itself is transparent with repect to UCS-2 and UTF-16; I cannot confirm that, however.

    Thanks for your patience - also, I have added several references concerning character encoding support in various versions of SQL server.

    ===============================================
    Description of storing UTF-8 data in SQL Server
    http://support.microsoft.com/kb/232580

    SQL Server 7.0 and SQL Server 2000 use a different Unicode encoding (UCS-2) and do not recognize UTF-8 as valid character data.

    ===============================================
    XML Support in Microsoft SQL Server 2005
    http://msdn.microsoft.com/en-us/library/ms345117.aspx

    SQL Server 2005 stores XML data as Unicode (UTF-16). XML data retrieved from the server comes out in UTF-16 encoding as well.

    Regards,
    Bill Wesse


    Escalation Engineer
    • Marked as answer by KeithHa Wednesday, August 13, 2008 11:17 PM
    Tuesday, July 15, 2008 11:29 AM
  • Good morning Paul - I am still waiting for devlopment response to your question about character encoding in MS-TDS.

    Thank you for your patience.

    Regards,
    Bill Wesse

    Escalation Engineer
    Friday, July 18, 2008 10:31 AM
  • Good morning again Paul - I am still waiting for devlopment response to your question about character encoding in MS-TDS.

    Thank you for your patience.

    Regards,
    Bill Wesse

    Escalation Engineer
    Friday, July 18, 2008 10:32 AM
  • Good morning Paul; we are currently awaiting approval for changes to the [MS-TDS] document to clarify the usage of Unicode. Thanks for your patience.

    Regards,
    Bill Wesse

    Escalation Engineer
    Tuesday, July 29, 2008 10:09 AM
  •  Good morning Paul. Thanks for your patience. I now have an answer for your question '[MS-TDS] Section 2.2: What does "Unicode" mean in this context?'. We have modified the [MS-TDS] Glossary (1.1) to include a definition for Unicode (in this case, UCS-2).

    [MS-TDS]

    1.1 Glossary
    The following terms are defined in [MS-GLOS]:
    ...
    Added:
    Unicode:  The set of characters as defined by [UNICODE] that are encoded in UCS-2.
    ...

    Regards,
    Bill Wesse


    Escalation Engineer
    Thursday, July 31, 2008 9:39 AM
  • Bill

    Thanks for the answer - that seems clear enough: "Unicode" means "UCS-2"
    Thursday, July 31, 2008 3:09 PM