locked
Saving UTF-8 in SQL Server 2005 RRS feed

  • Question

  • User151604073 posted

    How can I save UTF-8 strings in SQL Server 2005? I don't want to use nvarchar as it would take a lot of space and most of my text will be in latin characters (and it doesn't fully support all unicode code points anyway as it uses UCS2, I'm talking about the supplementary plane here which I know is very rarely used and most unicode fonts don't support it but it's still possible to be used at least in the future). I read in this KB article (http://support.microsoft.com/kb/232580) that my only option is to use varbinary, I'm worried though whether this could affect performance as I never used varbinary before .. I read in the same article that I cannot use varchar to store UTF-8 which I can't really understand, doesn't varchar use the whole 256 values to save character data according to the code page used? what makes UTF-8 characters not valid for being stored as varchar strings? (what I mean here is that I can do the conversion manually from varchar to UTF-8 but I only want to save the data in the varchar columns)

     Any suggestions are highly appreciated..

    Wednesday, January 2, 2008 6:42 AM

Answers

  • User-1225738063 posted

    Yes, varchar isn't capable of storing all the variations that a UTF-8 string requires.  Trying to store UTF-8 data into a varchar field can result in data loss.

    Saying that nvarchar is UCS2 is a bit misleading.  Nvarchar is a storage format, and it is capable of storing UTF-16, just fine.  It's the functions that act upon strings that don't support the surrogates in UTF-16.  A minor difference, but an important one.  If you aren't doing sorting, indexing, or comparisions on UTF-16 data, then nvarchar is just fine.

    Secondly, "they didn't also add any other types that does fully support Unicode".  It does FULLY support unicode 1.1.  You are looking for features added after version 1.1 that involve multiple code points to character conversions (Surrogates).  From what I have read, there have been many improvements in even SQL Server 2005 for supporting surrogates in SQL Server, but not all the functions were fully upgraded, so trying to use UTF-16 should be avoided at this time if possible.  It's an unsupported feature.  I suspect the beta documentation that would make such a claim would significantly lag behind the code that actually does it.  As it affects many many different functions, it may not make it into the documentation until release to manufacturing even if support is fully added.  Since the bulk of the problem is surrogate handling, I wouldn't be suprised if UTF-8 support was added at the same time.  I am suprised that UTF-32 which handles all of the unicode code space without requiring the use of surrogates hasn't been implemented yet.

    Also, I'm sure you are aware of the tradeoffs of UTF-8 vs UCS2 (and UTF-16 vs UCS2 for that matter):

    UTF-8 requires much more processing than UCS2 because entire strings much be searched for surrogates before answering many common questions like how many characters are in this string? Operations like instr have added complexity because you must check if the binary representation of your string happens to fall within the opening of a surrogate pair, etc etc.

    In order to ensure that you can correctly hold x number of characters in UTF-8 format, you must allocate four times the number of characters in bytes.  UCS2 always requires 2 times the number of characters in bytes.  This adversely affects page size, maximum characters per row, increases (doubles) the memory requirements for space allocation, etc.

    There are many other reasons as well.  But these reasons alone are enough to qualify UTF-8 as being a poor choice for a standard database encoding format.  In fact, UTF-8 is like the complete opposite of database friendly.  Where databases trade disk space for speed, UTF-8 tries to do the exact opposite, trading processing complexity for an average size reduction.  Really, I can't think of any reason why I would want to store data in UTF-8 format within a database.  I would like to see UTF-32 support though.

     

    Back to your inital question however:

    Varbinary would be appropriate because it is exactly what you are looking for. You want to store a variable amount of bytes into a field, and have the database assume nothing about the data.  That is exactly what varbinary is for.  Just make sure you define the varbinary field as having maximum size as 4 times the number of characters you want to be able to hold.  It should perform the same as a nvarchar field that has the same number of maxiumum bytes.

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Sunday, January 6, 2008 9:47 PM

All replies

  • User653228039 posted

    I'm pretty new to unicode as well, but from my research before, this is the understanding that I managed to take in: 

    To be properly stored on SQL Server, UTF-8 encoding requires nvarchar.  Another much less desirable option is to store the HTML encoding in varchar.  The reason for this requirement is that UTF-8 has support for eastern asian characters as well as some others (that's a lot of characters).  What nvarchar does is double the amount of bytes set aside for each character (which is necessary when you need to reference characters with very large codes).  I believe varchar uses 1 byte per character (8 bits or a set of 256 characters) and nvarchar uses 2 bytes per character (16 bits or a set of over 65,000 characters). The downside to nvarchar is that you can only store half the amount of the same text that varchar can hold.

    Wednesday, January 2, 2008 12:14 PM
  • User151604073 posted

    Hi, thanks for your reply but actually I have read extensively about unicode recently so I mostly know how it works. Having been reading for a while I finally decided that I wanted to use UTF-8 for storing my text. The problem though is that SQL Server 2005 (and also SQL Server 2008) doesn't support UTF-8, this is unlike other databases like Oracle and Sybase that have support to UTF-8 right out the box (varchar2 in Oracle and varchar in Sybase with the UTF-8 encoding). As I said in my original post, most of the text is in latin so it would be a huge waste of space to use nvarchar, the other problem is that nvarchar doesn't support the whole range of Unicode as it can only store 65536 values. This is a problem for a language like Chinese as recently the Chinese government decided to use code points outside of the BMP, and even more characters for other languages like Japanese can be added later which complicates the problem even more (there's a problem in Japan because of Han Unification in unicode). So, I'm basically trying to do it once and forget about the headache in the future. By the way, you can find extensive information about unicode on wikipedia.org, this is where I got most of this information.

    Wednesday, January 2, 2008 4:02 PM
  • User-1225738063 posted

    I would suggest using nvarchar.

    While you are correct, many of the text/string handling functions within SQL Server do not support the entire Unicode set, it can and does store and retrieve UTF-16 byte values without change.  As you may be aware, UTF-16 is capable of storing the entire Unicode set of data including those not in the BMP (same as UTF-8).  Given the alternative would be to use something like a UTF-8 encoded string stuffed into a varbinary field, the few limits to what you can do with the the string stored in a nvarchar field are rather small.

    I do believe that the SQL Server code base has been heading towards full UTF-16 support throughout, but it just wasn't completed before SQL Server 2005 was shipped.  I expect future versions to implement full support.  By using nvarchar now, you should be able to switch seamlessly should you need features (sort/indexing/comparision) that wouldn't be possible on an UTF-8 encoded string stored in a varbinary column.

    Wednesday, January 2, 2008 8:18 PM
  • User151604073 posted

    Thanks a lot for your reply, the problem is that I checked the beta documentation of SQL Server 2008 on Microsoft's website and unfortunately the nvarchar is still in ucs2, so as to Unicode support I see they have done nothing in 2008! (and they didn't also add any other types that does fully support Unicode so we basically still have the same options as in 2005). It's not that I'm worried about not having characters outside of the BMP, I know that this is not a big problem as it's rare condition, what I'm actually worried about is that most of the text in my application is in Latin characters so it's a huge waste of space to use nvarchar. You're absolutely correct that I can't use SQL Server functions with UTF-8 strings stored in varbinary but actually there's one thing I should've mentioned in my original post - and I'm sorry for that - which is that I only want to display the text in the browser, just read from the database and display or save from the browser to the database no searching, no sorting .. nothing

    As I mentioned in my original post, I read in the KB article referred to that I can use varbinary to save UTF-8 strings in SQL Server but as I never used varbinary before I'm a little concerned about the performance. Does it have any implications on performance? The other thing is that, which is just out of curiosity, the KB article mentions that you should never store UTF-8 text in varchar which I just don't understand, is it because you can't use SQL Server string functions with it (which I won't use anyway as I mentioned before) or is there anything else that makes it invalid? I understand that the characters returned from SQL Server will have be converted to UTF-8 but I believe this can be done in .NET or a function in SQL Server itself so I wonder what exactly makes it not possible?

    Thursday, January 3, 2008 9:08 AM
  • User-1225738063 posted

    Yes, varchar isn't capable of storing all the variations that a UTF-8 string requires.  Trying to store UTF-8 data into a varchar field can result in data loss.

    Saying that nvarchar is UCS2 is a bit misleading.  Nvarchar is a storage format, and it is capable of storing UTF-16, just fine.  It's the functions that act upon strings that don't support the surrogates in UTF-16.  A minor difference, but an important one.  If you aren't doing sorting, indexing, or comparisions on UTF-16 data, then nvarchar is just fine.

    Secondly, "they didn't also add any other types that does fully support Unicode".  It does FULLY support unicode 1.1.  You are looking for features added after version 1.1 that involve multiple code points to character conversions (Surrogates).  From what I have read, there have been many improvements in even SQL Server 2005 for supporting surrogates in SQL Server, but not all the functions were fully upgraded, so trying to use UTF-16 should be avoided at this time if possible.  It's an unsupported feature.  I suspect the beta documentation that would make such a claim would significantly lag behind the code that actually does it.  As it affects many many different functions, it may not make it into the documentation until release to manufacturing even if support is fully added.  Since the bulk of the problem is surrogate handling, I wouldn't be suprised if UTF-8 support was added at the same time.  I am suprised that UTF-32 which handles all of the unicode code space without requiring the use of surrogates hasn't been implemented yet.

    Also, I'm sure you are aware of the tradeoffs of UTF-8 vs UCS2 (and UTF-16 vs UCS2 for that matter):

    UTF-8 requires much more processing than UCS2 because entire strings much be searched for surrogates before answering many common questions like how many characters are in this string? Operations like instr have added complexity because you must check if the binary representation of your string happens to fall within the opening of a surrogate pair, etc etc.

    In order to ensure that you can correctly hold x number of characters in UTF-8 format, you must allocate four times the number of characters in bytes.  UCS2 always requires 2 times the number of characters in bytes.  This adversely affects page size, maximum characters per row, increases (doubles) the memory requirements for space allocation, etc.

    There are many other reasons as well.  But these reasons alone are enough to qualify UTF-8 as being a poor choice for a standard database encoding format.  In fact, UTF-8 is like the complete opposite of database friendly.  Where databases trade disk space for speed, UTF-8 tries to do the exact opposite, trading processing complexity for an average size reduction.  Really, I can't think of any reason why I would want to store data in UTF-8 format within a database.  I would like to see UTF-32 support though.

     

    Back to your inital question however:

    Varbinary would be appropriate because it is exactly what you are looking for. You want to store a variable amount of bytes into a field, and have the database assume nothing about the data.  That is exactly what varbinary is for.  Just make sure you define the varbinary field as having maximum size as 4 times the number of characters you want to be able to hold.  It should perform the same as a nvarchar field that has the same number of maxiumum bytes.

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Sunday, January 6, 2008 9:47 PM
  • User151604073 posted

    Thanks, but actually I don't agree with you on some points.

    You're saying it's misleading to say that nvarchar is UCS2 but actually this is what it is and this is what's written in the documentation. It's true that you can save UTF-16 chracters in a nvarchar column but that doesn't make it UTF-16, for example if you try to store a 20 character long UTF-16 string, which contains some surrogates, in a nvarchar(20) column, your string will be truncated and some of the data will be lost. You can only do this indirectly, so, you have to check whether the string contains surrogates or not when saving the string to make sure it won't be truncated, and when reading from the database you'll have to convert the string from UCS2 to UTF-16 (it won't be converted automatically), all this makes a nvarchar column not UTF-16.

    Also, you should never force the customer to use whatever column type available. You should always have the choice, what if want to use UTF-8 and I'm aware of all the problems it involves, I should never be forced to use other column types because what I want is not available. For example in my case, I only want to save the data in UTF-8 to save the space and I'm not going to use any string function or collation or whatever, why can't I have this?

    Back to my original post, I don't even want to have 4 characters for every UTF-8 character, I'm going to save as many characters as the varbinay can hold. e.g. if using a varbinary(5000), it can hold a maximum of 5000 latin characters, but if another language is used I'm going to save as many characters as the column can hold and the excess characters will be rejected.

    Thanks again...
     

    Monday, January 7, 2008 8:34 AM
  • User-1225738063 posted

    You are always free to not agree with me.

    I say it's misleading because you can store UTF-16 data in a nvarchar datatype with no conversion.  If you define your nvarchar too small, then you might possibly truncate your string.  That however is a documentation issue.  I guess the point I am making is that nvarchar itself defines a data type for storage.  SQL Server can store UTF-16 in nvarchar without conversion.  You set a field to a UTF-16 string and retrieve a field stored in UTF-16 without any conversion at all.

    I am guessing that English isn't your primary language, and while you speak it fairly well, "Also, you should never force the customer to use whatever column type available." doesn't translate very well, but I get your meaning I think.

    You can create your own UTF8 datatype if you want:

    EXEC sp_addtype 'tinyUTF8','varbinary(255),'NULL'

    EXEC sp_addtype 'smallUTF8','varbinary(65535),'NULL'

    EXEC sp_addtype 'UTF8','varbinary(max)','NULL'

    If that isn't close enough, you can create your own datatype using CLR.

     

    Wednesday, January 9, 2008 3:28 PM
  • User1222818074 posted

    SQL Server 2005 buit-in the Function 'N' prefix.  This is a simple way for work with Server-Side Programming with Unicode.

    Friday, January 16, 2009 2:50 AM