none
SQL Instance Collation - Language Neutral Required RRS feed

  • Question

  • SQL'ers

    I am not new to SQL but have run into a question about Language Neutral collation.

    We are building an instance for SQL Server 2008 R2 and the main purpose is for an application that is stating that the DB ( and it would be best at the instance level as well) should be built with a  case-insensitive, accent-insensitive, and language neutral collation.  We normally use SQL_Latin1_General_CP1_CI_AS so that would not totally work. The proprietry application indicates 'SQL_Latin1_General_CP437_CI_AI' as an example. Obviously that is case- insensitive and accent-insensitive. What is puzzling me is "language neutraul" - never had to deal with that one. So, is it safe to say that the common collation that many use 'SQL_Latin1_General_CP1_CI_AS' is NOT language neurtral? 

    Thanks MG

    Tuesday, September 16, 2014 6:30 PM

Answers

  • Hi there. I wanted to clarify a few things that have been said here:

    We normally use SQL_Latin1_General_CP1_CI_AS so that would not totally work.

    Actually, in this specific context (i.e. this particular application) it would have worked. You have been told that it won't, but yes, it would have. I will explain more in a moment.

    The proprietary application indicates SQL_Latin1_General_CP437_CI_AI as an example.

    Actually, no, it is not an example; it is one of two "supported" collations. The "proprietary application" is Atlassian Jira. According to their "Connecting Jira applications to SQL Server 2012" guide:

    "We support SQL_Latin1_General_CP437_CI_AI and Latin1_General_CI_AI as case-insensitive, accent-insensitive, and language neutral collation types."

    You can ignore the word "types". These two collations are listed in other places as the only two supported collations.

    What is puzzling me is "language neutral" - never had to deal with that one. So, is it safe to say that the common collation that many use SQL_Latin1_General_CP1_CI_AS is NOT language neurtral?

    Correct, that collation, and most others, are not "language neutral". There are three aspects of collations that are affected by the culture (i.e. Latin1_General vs French vs Hebrew vs Syriac, and so on):

    1. The code page used for the VARCHAR / CHAR / TEXT character set / encoding
    2. The comparison and sorting rules
    3. uppercase / lowercase mappings (though this only impacts 2 characters — dotted and dotless "i"/"I" — and for only 2 cultures — Azeri and Turkish)

    Language Neutral Collations?

    There are no collations that are fully language neutral like the "invariant" culture in .NET. However, there are some collations that are effectively language neutral: binary collations. All binary collations are language neutral in terms of comparison and sorting rules. They have no linguistic rules at all. Their code page still controls the character set for VARCHAR data, but for NVARCHAR data that is a non-issue. So technically speaking, binary collations, when used with NVARCHAR data, are language neutral (not 100% due to the uppercase/lowercase mapping issue for the Turkish/Azeri dotted and dotless "i"s, but as far as I know that is the only exception, so 99.99% is good enough, right? ;-).

    For more details on these distinctions, please see: 

    Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2)

    BUT, just because there are collations that are effectively language neutral does not mean that any of them will work for this purpose. They won't. They are very much not case insensitive, and case insensitivity is really the only true requirement. The phrase "language neutral", at least in terms of it being used in the Jira documentation, is nonsensical. There is no reason for that phrase to be there outside of simply not understanding how collations (at least in SQL Server) work.

    Code Page 437?

    • Code Page 437 chart
    • Code Page 1252 chart
    • 54 characters shared by both code pages:

      ƒ   ¡ ¢ £ ¥ ª « ¬ ° ± ² µ · º » ¼ ½ ¿ Ä Å Æ Ç É Ñ Ö Ü ß à á â ä å æ ç è é ê ë ì í î ï ñ ò ó ô ö ÷ ù ú û ü ÿ

    • 74 characters unique to code page 437:

      ₧ ⌐ ░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐ └ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧ ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀ α Γ π Σ σ τ Φ Θ Ω δ ∞ φ ε ∩ ≡ ≥ ≤ ⌠ ⌡ ≈ ∙ √ ⁿ ■

    • 74 characters unique to code page 1252 (the 5 boxes are undefined characters):

      €  ‚ „ … † ‡ ˆ ‰ Š ‹ Œ  Ž   ‘ ’ “ ” • – — ˜ ™ š › œ  ž Ÿ ¤ ¦ § ¨ © ­ ® ¯ ³ ´ ¶ ¸ ¹ ¾ À Á Â Ã È Ê Ë Ì Í Î Ï Ð Ò Ó Ô Õ × Ø Ù Ú Û Ý Þ ã ð õ ø ý þ

    So, Code Page 437, while not ideal, could work, and it seems to for many people who have, for better or for worse, followed Atlassian's incorrect guidance and used that collation. And as you can see, words like résumé, naïve, and rôle certainly will work, though many others clearly will not.

    Which Collation to Use?

    I recommend using Latin1_General_100_CI_AI. I recommend this because:

    1. The vast majority of all string columns in tables created by Atlassian use NCHAR, NVARCHAR, and NTEXT (yes, this has been deprecated since the release of SQL Server 2005, and is why you can't use a collation ending with "_SC"). Meaning: the datatypes are Unicode and don't even use code pages in the first place.
    2. If the code page did matter, then there are several other cultures that use code page 1252 (the code page used by one of there supported collations: Latin1_General_CI_AI), such as the French_ collations. So wouldn't French_CI_AI also be supported?
    3. They can't be relying on any of the characters that are unique to either code page 437 or code page 1252 as the app wouldn't work since you can't use both collations.
    4. Since they are really only using the Unicode-only types (i.e. the "N"-prefixed types), then the sorting and comparison rules are not only the same between the two officially supported collations, but are also identical to the following collations: SQL_AltDiction_CP850_CI_AI, SQL_Latin1_General_CP1_CI_AI, SQL_Latin1_General_CP1253_CI_AI, and SQL_Latin1_General_CP850_CI_AI. So why aren't those 4 supported?
    5. If Latin1_General_CI_AI is supported, then there's almost no chance that Latin1_General_100_CI_AI (which is a far better choice) isn't also supported. The version 100 collation has about 15,400 more sort weight definitions, plus 438 more uppercase/lowercase mappings. Not having those sort weights means that 15,400 more characters in the non-100 version equate to space, an empty string, and to each other. Not having those case mappings means that 438 more characters in the non-100 version return the character passed in (i.e. no change) for the UPPER() and LOWER() functions. There is no reason at all to want Latin1_General_CI_AI instead of Latin1_General_100_CI_AI. There might be a need if code was put into place to work around these deficiencies, and that code would behave incorrectly under the newer, better version of that collation. However, it's highly unlikely that code was put into place to account for this, and extremely unlikely that if such code did exist, that it would error or doing things incorrectly due to the newer collation.
    6. There is a) no evidence to suggest that other collation wouldn't work just fine. Perhaps not binary collations (and maybe also not case-sensitive?), but any case-insensitive (and optionally accent-insensitive) collation should be just fine. b) there is evidence of several cases of installations working for years under other collations (i.e. case-insensitive collations that are neither of the two supported collations).
    7. Between Latin1_General_CI_AI and Latin1_General_100_CI_AI, the character sets are the same for both VARCHAR (code page 1252) and NVARCHAR (Unicode), so no potential data loss. They are both case and accent INsensitive, so the sorting and comparison behavior is generally the same. I emphasized "generally" because the version 100 variant is simply better / more accurate in terms of sorting and comparison; it can even differentiate between different supplementary characters whereas the "supported" variant will see ALL supplementary characters (emojis and many others) as ALL being the same (in fact, the same nothing as those extra 15,000+ characters in the non-version 100 variant, or extra 115,000+ characters if you throw in the first 100k supplementary characters). Given that the code is written in US English, it won't notice these differences. But, queries against the data you enter (including config data for project names, issue types, story names, descriptions, comments, etc) will notice the difference if you ever use any of the characters with missing sort weights and/or uppercase/lowercase mappings. If you are searching for something, wouldn't you want as many characters as possible to actually match that same character in the data? Characters such as — ǹ ȧ ȼ ѐ ӧ — might not get used much, but if they ever do, you won't be able to filter on them unless you use Latin1_General_100_CI_AI (or even Latin1_General_100_CI_AS).
    8. All information thus far points to this being an issue of Atlassian simply not knowing anything about SQL Server collations (the two supported collations are one indicator, and the whole "language neutral" thing is another). And to be fair, they aren't a Microsoft shop. They produce Java apps that work against several RDBMSs. They're doing ok with SQL Server, but clearly need to fix the NTEXT columns to be NVARCHAR(MAX), and need to update their collation requirements (and some of the code related to temp tables). I suspect that they claim to support only these two collations due to running into problems years ago, not knowing what was happening or how to fix it, but saw that they didn't have problems with these two collations, and so decided to make things easy by just simply requiring one of these two.


    All of that being said, if you are going to stick with one of the two "supported" collations because it was what the vendor stated, even if they have no idea why they stated it, and even if they were unaware of a better option (i.e. what I recommended: Latin1_General_100_CI_AI), then please pick Latin1_General_CI_AI instead of the SQL Server collation SQL_Latin1_General_CP437_CI_AI.

    Anything Else?

    Yes, be sure to use the same default collation at the instance-level and at the database-level. There is at least one bug due to some code joining a table in the Jira DB with a local temporary table on a string column. They did not use COLLATE DATABASE_DEFAULT on the predicate and so get a collation mismatch error:

    Collation error in Jira when the database and server/default collation do not match in SQL Server



    P.S. For more info on working with collations in general (especially within SQL Server), please visit: Collations Info



    Take care,
    Solomon...

    SQL# — https://SQLsharp.com/ ( SQLCLR library of over 340 Functions and Procedures)
    Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
    Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
    Info sites — Collations     •     Module Signing     •     SQLCLR


    • Marked as answer by mg30 Sunday, June 2, 2019 4:43 PM
    • Edited by Solomon Rutzky Monday, June 3, 2019 2:49 PM
    Friday, May 31, 2019 7:07 AM
  • Do not use SQL_Latin1_General_CP437_CI_AI; CP437 is the code page for the command-line window in some installations of Windows. You will get problems with varchar data that includes words like résumé, naïve, rôle.

    Language-neutral? There is no such thing. I would go for Latin1_General_CI_AI. (I also recommend to stay away from SQL collations in general.) This collation is good for English, German, Dutch, Italian and I would also suspect Russian, if you are using nvarchar data.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Tuesday, September 16, 2014 10:03 PM

All replies

  • Can you ask the originator what is "language neutral"?  Each collation has a language attached to it.

    select name, COLLATIONPROPERTY(name, 'CodePage') as Code_Page, description
    from sys.fn_HelpCollations() ORDER BY Name;
    /*
    Albanian_100_BIN	1250	Albanian-100, binary sort
    Albanian_100_BIN2	1250	Albanian-100, binary code point comparison sort
    Albanian_100_CI_AI	1250	Albanian-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive
    Albanian_100_CI_AI_KS	1250	Albanian-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive
    Albanian_100_CI_AI_KS_SC	1250	Albanian-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive, supplementary characters
    Albanian_100_CI_AI_KS_WS	1250	Albanian-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive
    Albanian_100_CI_AI_KS_WS_SC	1250	Albanian-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive, supplementary characters
    Albanian_100_CI_AI_SC	1250	Albanian-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive, supplementary characters
    Albanian_100_CI_AI_WS	1250	Albanian-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive
    ...*/



    Kalman Toth Database & OLAP Architect SQL Server 2014 Database Design
    New Book / Kindle: Exam 70-461 Bootcamp: Querying Microsoft SQL Server 2012


    Tuesday, September 16, 2014 7:30 PM
  • Do not use SQL_Latin1_General_CP437_CI_AI; CP437 is the code page for the command-line window in some installations of Windows. You will get problems with varchar data that includes words like résumé, naïve, rôle.

    Language-neutral? There is no such thing. I would go for Latin1_General_CI_AI. (I also recommend to stay away from SQL collations in general.) This collation is good for English, German, Dutch, Italian and I would also suspect Russian, if you are using nvarchar data.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Tuesday, September 16, 2014 10:03 PM
  • Ereland, Thanks. You have confirmed it. I have never heard of language-neutral when referring to collations either. We will stay with SQL_Latin1_General_CP1_CI_AI

    Thanks.

    Wednesday, September 17, 2014 9:35 AM
  • Hi there. I wanted to clarify a few things that have been said here:

    We normally use SQL_Latin1_General_CP1_CI_AS so that would not totally work.

    Actually, in this specific context (i.e. this particular application) it would have worked. You have been told that it won't, but yes, it would have. I will explain more in a moment.

    The proprietary application indicates SQL_Latin1_General_CP437_CI_AI as an example.

    Actually, no, it is not an example; it is one of two "supported" collations. The "proprietary application" is Atlassian Jira. According to their "Connecting Jira applications to SQL Server 2012" guide:

    "We support SQL_Latin1_General_CP437_CI_AI and Latin1_General_CI_AI as case-insensitive, accent-insensitive, and language neutral collation types."

    You can ignore the word "types". These two collations are listed in other places as the only two supported collations.

    What is puzzling me is "language neutral" - never had to deal with that one. So, is it safe to say that the common collation that many use SQL_Latin1_General_CP1_CI_AS is NOT language neurtral?

    Correct, that collation, and most others, are not "language neutral". There are three aspects of collations that are affected by the culture (i.e. Latin1_General vs French vs Hebrew vs Syriac, and so on):

    1. The code page used for the VARCHAR / CHAR / TEXT character set / encoding
    2. The comparison and sorting rules
    3. uppercase / lowercase mappings (though this only impacts 2 characters — dotted and dotless "i"/"I" — and for only 2 cultures — Azeri and Turkish)

    Language Neutral Collations?

    There are no collations that are fully language neutral like the "invariant" culture in .NET. However, there are some collations that are effectively language neutral: binary collations. All binary collations are language neutral in terms of comparison and sorting rules. They have no linguistic rules at all. Their code page still controls the character set for VARCHAR data, but for NVARCHAR data that is a non-issue. So technically speaking, binary collations, when used with NVARCHAR data, are language neutral (not 100% due to the uppercase/lowercase mapping issue for the Turkish/Azeri dotted and dotless "i"s, but as far as I know that is the only exception, so 99.99% is good enough, right? ;-).

    For more details on these distinctions, please see: 

    Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2)

    BUT, just because there are collations that are effectively language neutral does not mean that any of them will work for this purpose. They won't. They are very much not case insensitive, and case insensitivity is really the only true requirement. The phrase "language neutral", at least in terms of it being used in the Jira documentation, is nonsensical. There is no reason for that phrase to be there outside of simply not understanding how collations (at least in SQL Server) work.

    Code Page 437?

    • Code Page 437 chart
    • Code Page 1252 chart
    • 54 characters shared by both code pages:

      ƒ   ¡ ¢ £ ¥ ª « ¬ ° ± ² µ · º » ¼ ½ ¿ Ä Å Æ Ç É Ñ Ö Ü ß à á â ä å æ ç è é ê ë ì í î ï ñ ò ó ô ö ÷ ù ú û ü ÿ

    • 74 characters unique to code page 437:

      ₧ ⌐ ░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐ └ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧ ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀ α Γ π Σ σ τ Φ Θ Ω δ ∞ φ ε ∩ ≡ ≥ ≤ ⌠ ⌡ ≈ ∙ √ ⁿ ■

    • 74 characters unique to code page 1252 (the 5 boxes are undefined characters):

      €  ‚ „ … † ‡ ˆ ‰ Š ‹ Œ  Ž   ‘ ’ “ ” • – — ˜ ™ š › œ  ž Ÿ ¤ ¦ § ¨ © ­ ® ¯ ³ ´ ¶ ¸ ¹ ¾ À Á Â Ã È Ê Ë Ì Í Î Ï Ð Ò Ó Ô Õ × Ø Ù Ú Û Ý Þ ã ð õ ø ý þ

    So, Code Page 437, while not ideal, could work, and it seems to for many people who have, for better or for worse, followed Atlassian's incorrect guidance and used that collation. And as you can see, words like résumé, naïve, and rôle certainly will work, though many others clearly will not.

    Which Collation to Use?

    I recommend using Latin1_General_100_CI_AI. I recommend this because:

    1. The vast majority of all string columns in tables created by Atlassian use NCHAR, NVARCHAR, and NTEXT (yes, this has been deprecated since the release of SQL Server 2005, and is why you can't use a collation ending with "_SC"). Meaning: the datatypes are Unicode and don't even use code pages in the first place.
    2. If the code page did matter, then there are several other cultures that use code page 1252 (the code page used by one of there supported collations: Latin1_General_CI_AI), such as the French_ collations. So wouldn't French_CI_AI also be supported?
    3. They can't be relying on any of the characters that are unique to either code page 437 or code page 1252 as the app wouldn't work since you can't use both collations.
    4. Since they are really only using the Unicode-only types (i.e. the "N"-prefixed types), then the sorting and comparison rules are not only the same between the two officially supported collations, but are also identical to the following collations: SQL_AltDiction_CP850_CI_AI, SQL_Latin1_General_CP1_CI_AI, SQL_Latin1_General_CP1253_CI_AI, and SQL_Latin1_General_CP850_CI_AI. So why aren't those 4 supported?
    5. If Latin1_General_CI_AI is supported, then there's almost no chance that Latin1_General_100_CI_AI (which is a far better choice) isn't also supported. The version 100 collation has about 15,400 more sort weight definitions, plus 438 more uppercase/lowercase mappings. Not having those sort weights means that 15,400 more characters in the non-100 version equate to space, an empty string, and to each other. Not having those case mappings means that 438 more characters in the non-100 version return the character passed in (i.e. no change) for the UPPER() and LOWER() functions. There is no reason at all to want Latin1_General_CI_AI instead of Latin1_General_100_CI_AI. There might be a need if code was put into place to work around these deficiencies, and that code would behave incorrectly under the newer, better version of that collation. However, it's highly unlikely that code was put into place to account for this, and extremely unlikely that if such code did exist, that it would error or doing things incorrectly due to the newer collation.
    6. There is a) no evidence to suggest that other collation wouldn't work just fine. Perhaps not binary collations (and maybe also not case-sensitive?), but any case-insensitive (and optionally accent-insensitive) collation should be just fine. b) there is evidence of several cases of installations working for years under other collations (i.e. case-insensitive collations that are neither of the two supported collations).
    7. Between Latin1_General_CI_AI and Latin1_General_100_CI_AI, the character sets are the same for both VARCHAR (code page 1252) and NVARCHAR (Unicode), so no potential data loss. They are both case and accent INsensitive, so the sorting and comparison behavior is generally the same. I emphasized "generally" because the version 100 variant is simply better / more accurate in terms of sorting and comparison; it can even differentiate between different supplementary characters whereas the "supported" variant will see ALL supplementary characters (emojis and many others) as ALL being the same (in fact, the same nothing as those extra 15,000+ characters in the non-version 100 variant, or extra 115,000+ characters if you throw in the first 100k supplementary characters). Given that the code is written in US English, it won't notice these differences. But, queries against the data you enter (including config data for project names, issue types, story names, descriptions, comments, etc) will notice the difference if you ever use any of the characters with missing sort weights and/or uppercase/lowercase mappings. If you are searching for something, wouldn't you want as many characters as possible to actually match that same character in the data? Characters such as — ǹ ȧ ȼ ѐ ӧ — might not get used much, but if they ever do, you won't be able to filter on them unless you use Latin1_General_100_CI_AI (or even Latin1_General_100_CI_AS).
    8. All information thus far points to this being an issue of Atlassian simply not knowing anything about SQL Server collations (the two supported collations are one indicator, and the whole "language neutral" thing is another). And to be fair, they aren't a Microsoft shop. They produce Java apps that work against several RDBMSs. They're doing ok with SQL Server, but clearly need to fix the NTEXT columns to be NVARCHAR(MAX), and need to update their collation requirements (and some of the code related to temp tables). I suspect that they claim to support only these two collations due to running into problems years ago, not knowing what was happening or how to fix it, but saw that they didn't have problems with these two collations, and so decided to make things easy by just simply requiring one of these two.


    All of that being said, if you are going to stick with one of the two "supported" collations because it was what the vendor stated, even if they have no idea why they stated it, and even if they were unaware of a better option (i.e. what I recommended: Latin1_General_100_CI_AI), then please pick Latin1_General_CI_AI instead of the SQL Server collation SQL_Latin1_General_CP437_CI_AI.

    Anything Else?

    Yes, be sure to use the same default collation at the instance-level and at the database-level. There is at least one bug due to some code joining a table in the Jira DB with a local temporary table on a string column. They did not use COLLATE DATABASE_DEFAULT on the predicate and so get a collation mismatch error:

    Collation error in Jira when the database and server/default collation do not match in SQL Server



    P.S. For more info on working with collations in general (especially within SQL Server), please visit: Collations Info



    Take care,
    Solomon...

    SQL# — https://SQLsharp.com/ ( SQLCLR library of over 340 Functions and Procedures)
    Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
    Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
    Info sites — Collations     •     Module Signing     •     SQLCLR


    • Marked as answer by mg30 Sunday, June 2, 2019 4:43 PM
    • Edited by Solomon Rutzky Monday, June 3, 2019 2:49 PM
    Friday, May 31, 2019 7:07 AM
  • Thx Solomon

    MG

    Friday, May 31, 2019 11:38 AM