none
storing UTF-8 data in VARCHAR or NVARCHAR?

    Question

  • I am able to successfully store and retrieve UTF-8 Chinese characters in a SQL Server 2008 database using the PHP driver, version 1.1 (SQLNCLI10.DLL).

    According to the PHP driver documentation, using NVARCHAR and "CharacterSet" => "UTF-8" is required for UTF-8.  However, I am using neither of these.  I am using VARCHAR fields which require half of the disk space.

    When I query the database using MSSQL Management Studio, the text looks like junk.  However, my webpages display the Chinese characters stored in the database correctly as long as I specifiy <meta content="text/html; charset=utf-8" http-equiv="Content-Type" /> in the header section.

    Is the documentation incorrect? Or is what I am doing a hack which will cause problems for me later?  Thanks.



      

     

    Thursday, November 19, 2009 8:59 AM

Answers

  • The documentation is correct. You need to use NCHAR or NVARCHAR.

     

    The “N” data types (NCHAR and NVARCHAR) are Unicode types and the driver converts UTF-8 data to UTF-16 (and vice versa), UTF-16 being the encoding expected by the SQL Server. For CHAR and VARCHAR types, SQL Server expects the data encoded as UCS-2. Therefore, passing data encoded as UTF-16 is not matching the SQL Server expectations and, subsequently, you’ll end up with corrupted information in the database.

     

    The web pages display correctly because you are making the same mistake twice. Data uploaded is converted from UTF-8 to UTF-16, but wrongly stored as UCS-2, while data retrieved from SQL Server is wrongly assumed to be UTF-16 (while it is actually UCS-2) and converted to UTF-8. In summary, your web page might look ok, but the data stored in the server is corrupted.

     

    Hope this helps,

    Serban

     


    Serban Iliescu - MSFT
    Thursday, November 19, 2009 9:54 PM
    Answerer

All replies

  • The documentation is correct. You need to use NCHAR or NVARCHAR.

     

    The “N” data types (NCHAR and NVARCHAR) are Unicode types and the driver converts UTF-8 data to UTF-16 (and vice versa), UTF-16 being the encoding expected by the SQL Server. For CHAR and VARCHAR types, SQL Server expects the data encoded as UCS-2. Therefore, passing data encoded as UTF-16 is not matching the SQL Server expectations and, subsequently, you’ll end up with corrupted information in the database.

     

    The web pages display correctly because you are making the same mistake twice. Data uploaded is converted from UTF-8 to UTF-16, but wrongly stored as UCS-2, while data retrieved from SQL Server is wrongly assumed to be UTF-16 (while it is actually UCS-2) and converted to UTF-8. In summary, your web page might look ok, but the data stored in the server is corrupted.

     

    Hope this helps,

    Serban

     


    Serban Iliescu - MSFT
    Thursday, November 19, 2009 9:54 PM
    Answerer
  • Thanks for the detailed explanation Serban.  I tried to store UTF-8 properly in the database using NVARCHAR and "CharacterSet" => "UTF-8" but the Chinese characters still look like junk.  I also tried adding calls to iconv but that resulted in this error:

    -46: An error occurred translating the query string to UTF-16: No mapping for the Unicode character exists in the target multi-byte code page.

    Note I am trying to store Chinese text which uses 3 bytes per character in UTF-8. Is this supported with UTF-16 and PHP driver, v1.1? It seems like this could be the problem since I think UTF-16/NCHAR is limited to 2 bytes per character.


    I think I am going to go with my original VARCHAR hack solution because:

    1. The majority of the data is English which is stored correctly.
    2. My app currently only needs to display non-English characters which it does using this approach.
    3. VARCHAR will use half the disk space and result in performance benefits.

    Ideally I would like Microsoft to add a UTF-8 datatype to SQL Server so there is no storage penalty for storing mostly English characters with some multibyte characters mixed in. 
    Friday, November 20, 2009 7:11 AM
  • Internally the driver converts UTF-8 data to UTF-16 by employing Win32 API MultiByteToWideChar (while for the reverse conversion it uses WideCharToMultiByte).

     

    MultiByteToWideChar is called with MB_ERR_INVALID_CHARS flag, and the error you see is actually due to the failure of that API. It might be interesting to know what Chinese characters are causing the Win32 API to fail. Please post the failing UTF-8 data packed as binary string (i.e. via PHP pack()  function).

    Thank you,
    Serban


    Serban Iliescu - MSFT
    Friday, November 20, 2009 10:40 PM
    Answerer
  • Please post the failing UTF-8 data packed as binary string (i.e. via PHP pack()  function).

    Serban Iliescu - MSFT

    The error occurred when trying to insert the text in these tweets:

    http://twitter.com/kaifulee/status/4695878440
    http://twitter.com/kaifulee/status/4695992328

    I already rolled back all of my UTF-8 changes.  Hopefully you will be able to construct the binary strings yourself.  Thanks for your help.

    Saturday, November 21, 2009 10:56 AM
  • Thanks.  We will look at these posts and proceed from there.

    Jay

    Jay Kint MSFT
    Thursday, December 03, 2009 7:31 PM
    Moderator
  • Did try to change the font to view unicode/utf-8 chars in vs management studio?

    If not try this:

    1. Go to tools->options->Environment->Fonts and colors

    2. then change the font to Aerial Ms Unicode for Grid Results/Text Results/Text Editor

    ~Siva
    Wednesday, February 24, 2010 3:40 PM
  • >>The web pages display correctly because you are making the same mistake twice.

    if this is the case, is there any way to convert the UCS-2 text to UTF-16 ? at the database level? (not on the web page)

    Thursday, January 12, 2012 12:11 PM
  • @ ShajuKJ:

    I might be wrong but I thought that Microsoft's implementation of UCS-2 was synonymous with UTF-16, i.e. all of the characters from UCS-2 map to the same ones in UTF-16.

    UCS-2 was Windows' Unicode codepage.  It has been superseded by UTF-16.

    Serban's answer looks wrong to me - the varchar type is single-byte - *not* UCS-2.  Varchar depends on the SQL Server's locale settings (either the default locale or the column's locale) for collation purposes.  I don't think any character mapping is done when the varchar data contains unknown characters in the suggested character set.

    The answer to your question is that anything that was UCS-2 is also compatible with UTF-16.

    If you believe UCS-2 means VARCHAR, and you want to convert a varchar column to nvarchar:

    1. Create a new table exactly like the original, with NVARCHAR instead of VARCHAR columns
    2. Select data from the old table into the new table, e.g.:
        insert int newtable(a,b,c,d,...)
        select a,b,c,d, ... from oldtable
    3. Drop the old table
    4. Rename the new table to the old table using sp_rename.

    Rob
    • Edited by Robert Johnson Friday, January 13, 2012 2:14 PM Edited to show this is a response to ShajuKJ.
    Thursday, January 12, 2012 3:31 PM