none
Unicode vs ASCII difference and benefits

    Question

  • There are alot information about Unicode and ACII. But for me very hard to understand main ideas.

    I worked in one company on large business application - and one very experienced developer said - "We don't use Unicode (nvarchar) - we use varchar. Because it get more disk and memory size ( 1 byte for ASCII and 2 byte for Unicode. And It gets more performance in JOIN and other operation"

    When I had interview another experienced developer said - "You don't use nvarchar - this is terrible."

    And very often people ask "What are the Unicode adventadge ? When use nvarchar and when using varchar "

    So there are many resources when we should using nvarchar and when varchar - I understan it clearly. And I don't need any answers ( Answer ). I understanded that better to use always nvarchar. But sometimes varhcar can be used too.

    So my main question is "Difference Unicode vs ASCII and Unicode adventadge"

    I read alot of documetation and article and I want you corrected me if i am wrong.

    1. Unicode takes 2 byte. 1 byte for language page 1 byte for sign value.

    2. ASCII takes 1 byte. It doesn't containt info about language page and all bytes ( 8 ) contain sign info.

    3. If we will use in our application different language in one time. I mean we can see record on Engligh and Japan language - Unicode can solve this problem. Because it has language page info.

    4. If we will use in our application different language in one time. I mean we can see record on Engligh and Japan language - ASCII can't solve this problem, because it can store info only about one language.

    5. If our application using different language in different instance both ASCII and Unicode can solve this problem. Because it has 1 byte for storing sign.

    6. Also usually writing that Unicode can store more signs. I mean it can store Japan and Chine alphabet. But I can't understand why it can. Both Unicode and ASCII has 1 byte for storing sign info. So why Unicode can store more signs that ASCII ?



    I create my super player Media Glass


    • Edited by SmartWhy Friday, February 17, 2012 9:20 AM
    Friday, February 17, 2012 9:17 AM

Answers

  • Unicode takes 2 bytes on SQL-Server but it's not 1 byte for the language page and one for the sign value.  I don't know where you got this piece of information but it's wrong and therefore, the conclusions that you have drawn from this are also wrong.

    If you don't need to store any other language than english then stick with varchar.  If you need to then you can use Unicode and probably that you won't have any problem in many cases.  However, the real situation is a little more complicated than that  and using Unicode is not a guaranty that you won't have any problem using many languages with your database; so I will suggest that you keep reading in order to prepare yourself for the future as you never know when you might have to do this one day.

    • Marked as answer by SmartWhy Friday, February 17, 2012 11:07 AM
    Friday, February 17, 2012 10:30 AM
  • There are a number of misconceptions in your post, and I will make an attempt to sort them out.

    1. Unicode takes 2 byte. 1 byte for language page 1 byte for sign value.

    1) There is no "language page" per se in Unicode. Characters from the same charset are close to each other, and a new group typically starts at XX01, but there is no page per see.

    2) Unicode comprises 2^21 characters, and how much space each character takes up depends on the encoding. SQL Server uses UTF-16, which means that the characters in the base plane takes up two bytes, and others 4 bytes. (But beware that the latter are all "exotic" characters.) However, when you use compression SQL Server will a Unicode string in an internal format so that a string that consists only of ASCII characters will not take up more space than a varchar string.

    2. ASCII takes 1 byte. It doesn't containt info about language page and all bytes ( 8 ) contain sign info.

    ASCII is a 7-bit character set. A varchar column is tied to a code page. Some code pages have only 255 characters, but there are also code pages for multi-byte character set, so that you can store Chinese text in varchar.

    3. If we will use in our application different language in one time. I mean we can see record on Engligh and Japan language - Unicode can solve this problem. Because it has language page info.

    Indeed, if you expect that you need to support multiple languages, you should go for nvarchar.

    4. If we will use in our application different language in one time. I mean we can see record on Engligh and Japan language - ASCII can't solve this problem, because it can store info only about one language.

    Actually, English and Japanese can coexist in varchar if you use a Japanese collation. But you would not be able to also handle, say, Swedish text correctly. Overall, it would be an inferior solution, but it would be possible.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    • Marked as answer by SmartWhy Friday, February 17, 2012 11:43 AM
    Friday, February 17, 2012 11:38 AM
  • Hi,

    Some corrections from my side:

    There are multiple Unicode encodings. Common encodings are UTF8, UTF16 and UTF32. The Number gives the MINIMUM number of Bits used for a character. So UTF8, each character has at leat 1 Byte and UTF32 is using at least 4 Bytes. But depending on the character, the encoding can take up to 6 Bytes (AFAIK).

    ASCII is using a fixed table of 256 characters. Some are always the same and some can vary on a codepage. The codepage is not encoded inside the ASCII code. (So if you switch codepage, you get a problem with some characters!)

    Whenever you store text that could contain special characters, then you should ALWAYS use unicode. Even without taking care of chinese characters, you could run into problem with some special characters that different languages use e.g. german äöüß or things like á à from other languages ...

    So main difference between ASCII and Unicode is:
    In Ascii each byte is a character - so you can just define 256 characters.
    In Unicode, Bytes have a special meaning. So if you have UTF8, you rea a Byte and depending of the content you take more bytes into account, too.

    Wikipedia has very nice explanations of unicode encodings. Maybe you want to have a closed look at these pages?
    http://en.wikipedia.org/wiki/Unicode
    http://en.wikipedia.org/wiki/UTF8
    http://en.wikipedia.org/wiki/UTF16
    http://en.wikipedia.org/wiki/UTF32

    With kind regards,

    Konrad

    • Marked as answer by SmartWhy Friday, February 17, 2012 11:07 AM
    Friday, February 17, 2012 10:38 AM
  • You have some information wrong, here.

    Check this link, there is a lot to learn !! www.unicode.org

    • Marked as answer by SmartWhy Friday, February 17, 2012 11:07 AM
    Friday, February 17, 2012 10:41 AM

All replies

  • As you have learn't an ASCII char uses 1 byte and an UNICODE nchar uses 2 bytes.

    Let's consider a scenario,

    Assume you have a table with only one column which uses an ASCII char value. Let's assume there are trillions of rows making the total size of the table equals to 1 GB. (Ignore mathematical calculation stuff here. Just Assume.)

    Now you have this table occupy 1GB of space in your disk and when the data is in buffer pool (Memory), it will again use 1GB and to pull the table from disk to memory, your IO does 1GB of traffic.

    In this same scenario, if you are having an nchar column, as you know you will be having 2 bytes instead of 1 byte in ASCII.

    So, you will have a table sized 2GB, Need memory of 2GB and have a IO traffic of 2GB.

    Put it in money, You have just doubled your costs.

    Now every time transactions are done, data is passed to and fro by SQL Server and your IO does that.

    Bluntly speaking, if a transaction took 1 second to complete a task in ASCII, you will take 2 seconds to complete the same task with Unicode. similiarly, if a set of tasks is completed in 12 hours, it will take a day to complete it with unicode.

    Resource's and Time are money. To Save them, you have to make the best choice, here ASCII.

    This is the way things are looked at in the market. Your analysis is perfect but that does not fit better in the real scenario's.


    Planet Earth is at risk. Global warming is on a high tide.
    Take Responsibility. Plant Trees. Keep your City Clean and Green.

    Mark all Helping Posts and Close your Threads. Keep the Forum Green.
     - Arun Kumar Allu

    Friday, February 17, 2012 10:28 AM
  • Don't take my calculations as is. A lot of factors affect the scenario i have talked about.

    The post is just about presenting a view of looking at things.

    Technical reasons are not always the solution.

    Hope this makes a satisfactory explanation.


    Planet Earth is at risk. Global warming is on a high tide.
    Take Responsibility. Plant Trees. Keep your City Clean and Green.

    Mark all Helping Posts and Close your Threads. Keep the Forum Green.
     - Arun Kumar Allu

    Friday, February 17, 2012 10:30 AM
  • Unicode takes 2 bytes on SQL-Server but it's not 1 byte for the language page and one for the sign value.  I don't know where you got this piece of information but it's wrong and therefore, the conclusions that you have drawn from this are also wrong.

    If you don't need to store any other language than english then stick with varchar.  If you need to then you can use Unicode and probably that you won't have any problem in many cases.  However, the real situation is a little more complicated than that  and using Unicode is not a guaranty that you won't have any problem using many languages with your database; so I will suggest that you keep reading in order to prepare yourself for the future as you never know when you might have to do this one day.

    • Marked as answer by SmartWhy Friday, February 17, 2012 11:07 AM
    Friday, February 17, 2012 10:30 AM
  • Hi,

    Some corrections from my side:

    There are multiple Unicode encodings. Common encodings are UTF8, UTF16 and UTF32. The Number gives the MINIMUM number of Bits used for a character. So UTF8, each character has at leat 1 Byte and UTF32 is using at least 4 Bytes. But depending on the character, the encoding can take up to 6 Bytes (AFAIK).

    ASCII is using a fixed table of 256 characters. Some are always the same and some can vary on a codepage. The codepage is not encoded inside the ASCII code. (So if you switch codepage, you get a problem with some characters!)

    Whenever you store text that could contain special characters, then you should ALWAYS use unicode. Even without taking care of chinese characters, you could run into problem with some special characters that different languages use e.g. german äöüß or things like á à from other languages ...

    So main difference between ASCII and Unicode is:
    In Ascii each byte is a character - so you can just define 256 characters.
    In Unicode, Bytes have a special meaning. So if you have UTF8, you rea a Byte and depending of the content you take more bytes into account, too.

    Wikipedia has very nice explanations of unicode encodings. Maybe you want to have a closed look at these pages?
    http://en.wikipedia.org/wiki/Unicode
    http://en.wikipedia.org/wiki/UTF8
    http://en.wikipedia.org/wiki/UTF16
    http://en.wikipedia.org/wiki/UTF32

    With kind regards,

    Konrad

    • Marked as answer by SmartWhy Friday, February 17, 2012 11:07 AM
    Friday, February 17, 2012 10:38 AM
  • You have some information wrong, here.

    Check this link, there is a lot to learn !! www.unicode.org

    • Marked as answer by SmartWhy Friday, February 17, 2012 11:07 AM
    Friday, February 17, 2012 10:41 AM
  • Hi,

    I am just wondering. SQL Server is not using 2 bytes per character. It uses UTF16 which means, that it reads the data in chunks of 2 Bytes but it does not mean that each character is just using 1 chunk. Some characters require 2 chunks so it can be that one character in fact uses 4 bytes instead of 2.

    http://en.wikipedia.org/wiki/UTF16 - at least I understood UTF16 that way and I think that SQL Server sticks to UTF16.

    With kind regards,

    Konrad

    Friday, February 17, 2012 10:42 AM
  • Not exactly.  In a database, a lot of things are not stored as strings of characters like numbers and most joins will usually be done on integers and filtering on integers and date/time.  Expect an augmentation of about 30-35% in size when switching from ASCII to UNICODE and a diminution of performance of about 1 to 10% max.   So, unless your database is already running in a critical state, it should be impossible for your users to notice any sensible difference; with some rare exceptions here and there where it's possible that the diminution of performance could reach 20 or 30%.

    The classical exemple would be the search of an lengthy table with the LIKE '%string%' operator and where the diminution of performance will be typically around 20% if I remember correctly.

    While there are exceptions like Google - which anyway doesn't have any choice on this - for most people, there will be no difference in term of cost and time when using either ASCII or UNICODE because they will have some space to play with: their CPU are not running at 100% most of the time and their hard drives are not filled at 100% neither.

    Friday, February 17, 2012 10:53 AM
  • Not exactly: SQL-Server doesn't use UTF-16 but UCS-2; which is a subset of UTF16.  In UCS-2, all available characters are always 2 bytes; never 4.

    This doesn't mean that you cannot store UTF-16 in SQL-Server but if you do, some operations like sorting, filtering, searching or even the equality operator might yield incorrect results in regard to UTF-16.

    Friday, February 17, 2012 11:26 AM
  • There are a number of misconceptions in your post, and I will make an attempt to sort them out.

    1. Unicode takes 2 byte. 1 byte for language page 1 byte for sign value.

    1) There is no "language page" per se in Unicode. Characters from the same charset are close to each other, and a new group typically starts at XX01, but there is no page per see.

    2) Unicode comprises 2^21 characters, and how much space each character takes up depends on the encoding. SQL Server uses UTF-16, which means that the characters in the base plane takes up two bytes, and others 4 bytes. (But beware that the latter are all "exotic" characters.) However, when you use compression SQL Server will a Unicode string in an internal format so that a string that consists only of ASCII characters will not take up more space than a varchar string.

    2. ASCII takes 1 byte. It doesn't containt info about language page and all bytes ( 8 ) contain sign info.

    ASCII is a 7-bit character set. A varchar column is tied to a code page. Some code pages have only 255 characters, but there are also code pages for multi-byte character set, so that you can store Chinese text in varchar.

    3. If we will use in our application different language in one time. I mean we can see record on Engligh and Japan language - Unicode can solve this problem. Because it has language page info.

    Indeed, if you expect that you need to support multiple languages, you should go for nvarchar.

    4. If we will use in our application different language in one time. I mean we can see record on Engligh and Japan language - ASCII can't solve this problem, because it can store info only about one language.

    Actually, English and Japanese can coexist in varchar if you use a Japanese collation. But you would not be able to also handle, say, Swedish text correctly. Overall, it would be an inferior solution, but it would be possible.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    • Marked as answer by SmartWhy Friday, February 17, 2012 11:43 AM
    Friday, February 17, 2012 11:38 AM
  • Ahh ... great! Thank you for the explanation! Learned something new (Somehow I thought that UTF16 is used but that explanation makes sense!

    With kind regards,

    Konrad

    Friday, February 17, 2012 11:58 AM
  • Bluntly speaking, if a transaction took 1 second to complete a task in ASCII, you will take 2 seconds to complete the same task with Unicode. similiarly, if a set of tasks is completed in 12 hours, it will take a day to complete it with unicode.

    This is by no means true by necessity. If you have a Windows collation, operations are typically somewhat faster in nvarchar, because all operations are performed in Unicode, so you have a cost of converting from and back.

    If you have an SQL collation it is different, because an SQL collation in varchar only has 255 characters, and can be quicker. Sometimes a lot quicker.

    As I pointed out in my other post, it is also not necessarily true that your disk space doubles, since SQL 2008 R2 has Unicode compression. It is also important to understand that a database has more than character data, and also: you should not make all columns nvarchar. A column which is to hold a currency code, should be char(3) and nothing else.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Friday, February 17, 2012 11:28 PM
  • Not exactly: SQL-Server doesn't use UTF-16 but UCS-2; which is a subset of UTF16.  In UCS-2, all available characters are always 2 bytes; never 4.

    This doesn't mean that you cannot store UTF-16 in SQL-Server but if you do, some operations like sorting, filtering, searching or even the equality operator might yield incorrect results in regard to UTF-16.

    SQL 2012 adds a new suite of collations, all ending with _SC in the name. These collations have full support for UTF-16 and will handle so-called surrogates correctly. Note that nvarchar(50) does not mean 50 characters - it means 50 double-bytes. Thus, you can only store 25 characters that requires 4 bytes in nvarchar(50).


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Friday, February 17, 2012 11:31 PM
  • The classical exemple would be the search of an lengthy table with the LIKE '%string%' operator and where the diminution of performance will be typically around 20% if I remember correctly.

    If you have an SQL collation, the difference is about a factor of 7 for this example. For a Windows collation, I don't remember, but maybe not much at all, as the same rules are applied. But without compression the fact that you need to read more pages may affect negatively. (I have data for this for all collations somewhere, but I don't feel like digging it up right now.)


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Friday, February 17, 2012 11:33 PM
  • ASCII is using a fixed table of 256 characters.

    95 characters, not 256.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Friday, February 17, 2012 11:35 PM