I have a situation where I want to preserve the utf-8 encoding in my sql server 2005 database. Don't ask me why, but the gist of it is that I need to support a legacy application that manually converts the utf-8 encodings to unicode and it doesn't work with nchar/nvarchar columns.
The data is coming from a utf-8 encoded xml file. Here is a sample tag:
<tag>... 500 Å annealed ...</tag>
This is the hex, decimal, and unicode representations of the sample character respectively: 212b, 8491, Å
If I load the data into a varchar(max) column using a bulk insert operation, the utf-8 encoding is preserved. The data appears like this in the varchar column:
... 500 Å annealed ....
My problem arises when I convert this to xml in t-sql. As soon as I convert this to xml, the utf-8 encoding is gone, and it's been replaced with a unicde character:
... 500 Å, annealed ...
Now, if I convert back to varchar, it does this:
... 500 Å, annealed ...
Which is a different character.
Is there anyway to preseve the utf-8 encoding when you convert to an xml datatype in sql server?
I don't believe that this is possible. You can specify the attributes encoding, version, and/or stand alone, but these are stripped off during storage as an xml data type.
Hope this helps
You can convert the xml to VARCHAR(MAX) and specify the utf encoding, but once stored as xml the encoding is explicit.
I have never had a situation in which I needed to do such a function, but I know that explicit encoding can be done in the .NET framework which would leave open the possibility of a CLR function or procedure, but since the result set would stil be displayed from SQL I am unsure if the extra work would be reversed upon presentation.
I will research this to see if it is possible.
If you want to preserve encoding, you should probably consider storing the document as VARBINARY(MAX).
Hi, (late response but might help someone in future)
VARBINARY did not work for me, probably 'coz of the my datasource didn't comply. Heres what worked for me at the SQL end;
1) Store your raw xml data as VARCHAR or TEXT (instead of NVARCHAR or NTEXT) into a variable,
2) Read this variable into the xml data type using utf-8 encoding. Something like:
SET @TempHTMLText = --your raw xml data
DECLARE @XMLDataText XML
SELECT @XMLDataType = '<?xml version="1.0" encoding="utf-8" ?>' + @TempHTMLText