none
preserve utf-8 encoding in sql server when converting to xml datatype

    Question

  • Hi,

    I have a situation where I want to preserve the utf-8 encoding in my sql server 2005 database. Don't ask me why, but the gist of it is that I need to support a legacy application that manually converts the utf-8 encodings to unicode and it doesn't work with nchar/nvarchar columns.

     The data is coming from a utf-8 encoded xml file. Here is a sample tag:

     <tag>... 500 &#x212b; annealed ...</tag>

    This is the hex, decimal, and unicode representations of the sample character respectively: 212b, 8491, Å

    If I load the data into a varchar(max) column using a bulk insert operation, the utf-8 encoding is preserved. The data appears like this in the varchar column:

    ... 500 &#x212b; annealed ....

    My problem arises when I convert this to xml in t-sql. As soon as I convert this to xml, the utf-8 encoding is gone, and it's been replaced with a unicde character:

    ...  500 Å, annealed ...

    Now, if I convert back to varchar, it does this:

    ...  500 Å, annealed ...

    Which is a different character.

    Is there anyway to preseve the utf-8 encoding when you convert to an xml datatype in sql server?

     Thanks!

    Wednesday, January 28, 2009 7:07 PM

Answers

All replies

  • I don't believe that this is possible.  You can specify the attributes encoding, version, and/or stand alone, but these are stripped off during storage as an xml data type.

    http://technet.microsoft.com/en-us/library/ms187107(SQL.90).aspx

    http://msdn.microsoft.com/en-us/library/ms345117.aspx

     

    Hope this helps

     


    David Dye
    Wednesday, January 28, 2009 7:15 PM
    Moderator
  • Thanks for your reply. That's what I was afraid of.

    Are there any functions to take unicode characters and return the utf-8 encoding in sql server?

    For example,

    select dbo.encodeutf8('Å')

    Would return

    -------------

    0x212b

    Wednesday, January 28, 2009 7:29 PM
  • You can convert the xml to VARCHAR(MAX) and specify the utf encoding, but once stored as xml the encoding is explicit.

    I have never had a situation in which I needed to do such a function, but I know that explicit encoding can be done in the .NET framework which would leave open the possibility of a CLR function or procedure, but since the result set would stil be displayed from SQL I am unsure if the extra work would be reversed upon presentation.

     

    I will research this to see if it is possible.

     


    David Dye
    Wednesday, January 28, 2009 8:02 PM
    Moderator
  • Thanks for your help.

    How to you convret xml to varchar and specify the utf encoding in sql server?

    Wednesday, January 28, 2009 8:13 PM
  • If you want to preserve encoding, you should probably consider storing the document as VARBINARY(MAX). 
    http://jacobsebastian.blogspot.com/
    Friday, January 30, 2009 12:35 PM
    Moderator
  • I have exhausted all options and ideas that I had.  I even attempted converting a sql xml result set to utf-8 using a clr function and was unable to do this.  I have to concur with Jacob's reccommendation above.

    Hope this hepls

    David Dye
    Sunday, February 01, 2009 6:41 PM
    Moderator
  • Hi, (late response but might help someone in future)
    VARBINARY did not work for me, probably 'coz of the my datasource didn't comply. Heres what worked for me at the SQL end;

    1) Store your raw xml data as VARCHAR or TEXT (instead of NVARCHAR or NTEXT) into a variable,

    2) Read this variable into the xml data type using utf-8 encoding. Something like:

        DECLARE @TempHTMLText

        SET @TempHTMLText = --your raw xml data
        DECLARE @XMLDataText XML

        SELECT @XMLDataType = '<?xml version="1.0" encoding="utf-8" ?>' + @TempHTMLText

     

    Thursday, April 28, 2011 7:26 AM