none
Sort Miultiple language data in sql serevr by collation RRS feed

  • Question

  • Envirment: Sql Server 2008 R2 

    I have a table Column which have Multiple language data
    I need to apply Sorting according to my specified language like Hindi.
    How to achive it though collation or something else<br>

    **Column Name**: "Comments"<br>
    **Type** : Nvarchar(MAx)<br>
    **Column data**
    -------------------
    This is an example<br>
    To je příklad.<br>
    هذا مثال على ذلك.<br>
    उदाहरण है.<br>
    यह एक उदाहरण है.<br>

    ಈ ಒಂದು ಉದಾಹರಣೆಯಾಗಿದೆ.<br>

    **Output after 'Hindi' sorting should be** :<br>

    उदाहरण है.<br>
    यह एक उदाहरण है.<br>
    This is an example<br>
    To je příklad.<br>
    هذا مثال على ذلك.<br>
    ಈ ಒಂದು ಉದಾಹರಣೆಯಾಗಿದೆ.<br>

    I have tried following Script <br>

    CREATE TABLE dbo.Data
    ( unicodeData NVARCHAR(200)
    )
    GO
    INSERT INTO dbo.Data ( unicodeData)
    VALUES
     ( N'This is an example')
    , (N'यह एक उदाहरण है.')
    , (N'उदाहरण है.')
    , (N'ಈ ಒಂದು ಉದಾಹರಣೆಯಾಗಿದೆ.')
    , (N'एक उदाहरण है.')
    , (N'هذا مثال على ذلك.')
    , (N'To je příklad.');
    GO
    
    select * from dbo.unicodeData
    order by unicodedata Collate Indic_General_100_CI_AI

    Please Give your suggested solution it is very urgent for me

    Thanks.


    Shashank


    Friday, December 5, 2014 11:36 AM

Answers

  • This is a "All-In-One" solution you can use based on my explanation above:

    select unicodeData, CultureN, 
    	RN = 
    	CASE    
    		when CultureN = 'He-IL' then RN_He
    		when CultureN = 'En-US' then RN_En
    		else RN_Else
    	END
    from (	select 
    		unicodeData, 
    		UM.CultureN, 
    		ROW_NUMBER() OVER (order by unicodedata Collate Hebrew_CI_AS) RN_He,
    		ROW_NUMBER() OVER (order by unicodedata Collate Indic_General_100_CI_AI) RN_Else,
    		ROW_NUMBER() OVER (order by unicodedata Collate SQL_Latin1_General_CP1_CI_AS) RN_En
    	from Data D
    	left JOIN UnicodeMapping UM on UNICODE(LEFT(D.unicodeData,1)) = UM.UnicodeNum
    )t
    order by CultureN,RN

    ** Again, if you don't know the languages that you have in the table, then you can use this solution with dynamic query

    ** just want to say that I am writing article regarding this issue...
    check TechNet WIKI for more information :-)


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]

    Friday, December 5, 2014 2:50 PM
    Moderator

All replies

  • You would need a additional order by condition 

    drop table unicodeData
    CREATE TABLE dbo.unicodeData
    ( languageUsed VARCHAR(50)
    , unicodeData NVARCHAR(max)
    , nonUnicodeData VARCHAR(200) -- same data in a normal VARCHAR column for comparison
    , comments VARCHAR(100)
    )
    GO
    INSERT INTO dbo.unicodeData (languageUsed, unicodeData, nonUnicodeData, comments)
    VALUES
     ('English', N'This is an example', N'This is an example', NULL)
    , ('Hindi', N'यह एक उदाहरण है.', N'यह एक उदाहरण है.', 'Using the preceding N in both strings but VARCHAR is still a ?')
    , ('Hindi', 'यह एक उदाहरण है.', 'यह एक उदाहरण है.', 'Not using the preceding N in both strings so both are a ?')
    , ('Kannada', N'ಈ ಒಂದು ಉದಾಹರಣೆಯಾಗಿದೆ.', N'ಈ ಒಂದು ಉದಾಹರಣೆಯಾಗಿದೆ.', NULL)
    , ('Arabic', N'هذا مثال على ذلك.', N'هذا مثال على ذلك.', NULL)
    , ('Czech', N'To je příklad.', N'To je příklad.', NULL);
    GO
    
    select * from dbo.unicodeData
    order by case when languageUsed='Hindi' then 1 else 99 end,unicodedata Collate Indic_General_100_CI_AI 


    Satheesh
    My Blog | How to ask questions in technical forum


    Friday, December 5, 2014 11:43 AM
  • Thanks Satheesh for quick reply but i can not use languageUserd Column as in my orignal db i have only UnicodeData Column.

    I am editing my script for less confusion ,

    Can you describe that how collation work in order by for hindi in detail.


    Shashank

    Friday, December 5, 2014 12:09 PM
  • If I understand your post, you want text in Hindi to be sorted before any other script. The way to achieve this would be do like Satheesh suggested, but rather than "languageUsed='Hindi'", you would have

      CASE dbo.is_hindi(unicodedata) WHEN 1 ELSE 99 END

    The is_hindi would be a user-defined function, probably inmplemented in the CLR, that deduces whether the string is Hindi or some other language. Definitely a non-trivial problem. And, no, I am not going to write it, given my poor command of Hindi.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Friday, December 5, 2014 1:35 PM
  • Using one query with one "order by...Collate Indic_General_100_CI_AI" will lead to bad order for all other languages.

    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]

    Friday, December 5, 2014 2:13 PM
    Moderator
  • Hi

    If I understand correctly your data include several languages and you want to select it all together.
    sorting by specific culture will lead to a bad sorting, which fit only those specific culture's rows.
    you should use different approach, in this case :-)

    ** I will use UNION to make it more clear, but you can do it in one select and several ROW_NUMBER in the same select.

    First you need an accessory table (Named like UnicodeMapping) which include all unicode characters and the number of the char in unicode (a mapping unicode table). You can use ranges as well, but it will be faster for the queries if you actually have all the characters, and not just range.

    For example this table (I added English and Hebrew... Do the same with all the languages that you need):

    create table UnicodeMapping (Charecter nchar(1), UnicodeNum int, CultureN NVARCHAR(100), CollateN NVARCHAR(100))
    GO
    -- fill the table with main Hebrew characters, using a number table
    insert UnicodeMapping (Charecter, UnicodeNum, CultureN, CollateN)
    select NCHAR(n), n, 'He-IL', 'Hebrew_CI_AS'
    from _ArielyAccessoriesDatabase.dbo.ArielyNumbers
    where 
    	n between 1488 and 1514 -- Hebrew 
    	or n between 64304 and 64330 -- Hebrew 
    GO
    -- fill the table with main English characters, using a number table insert UnicodeMapping (Charecter, UnicodeNum, CultureN, CollateN) select NCHAR(n), n, 'En-US', 'SQL_Latin1_General_CP1_CI_AS' from _ArielyAccessoriesDatabase.dbo.ArielyNumbers where n between 97 and 122 -- En or n between 65 and 90 -- En GO -- Do the same with all the languages that you need, and all the UNICODE ranges for those languages select * from UnicodeMapping GO

    Let's create your table and insert the data (I added a row in Hebrew for the demo, since I did not have the other languages UNICODE mapping.. you can get it from Google)

    CREATE TABLE dbo.Data (unicodeData NVARCHAR(200))
    GO
    
    INSERT INTO dbo.Data (unicodeData)
    VALUES
      (N'This is an example')
    	  , (N'यह एक उदाहरण है.')
    	  , (N'उदाहरण है.')
    	  , (N'ಈ ಒಂದು ಉದಾಹರಣೆಯಾಗಿದೆ.')
    	  , (N'एक उदाहरण है.')
      , (N'هذا مثال على ذلك.')
      , (N'To je příklad.')
      , (N'זה רק דוגמה')
    GO

    Now in the query you can use UNION, and in each part, select only the rows from a specific language.

    * In order to know what language this row use, you should use the function UNICODE and compare the result to your accessory table (UnicodeMapping)

    ** For most cases, it is much better to store the language in the Data table, and not using this complex mapping each time!

    Now in each UNION part you should use the right culture for the select. and you can use something like this:


    select unicodeData from (
    	select unicodeData, UM.CultureN, ROW_NUMBER() OVER (order by unicodedata Collate Hebrew_CI_AS) RN
    	from Data D
    	left JOIN UnicodeMapping UM on UNICODE(LEFT(D.unicodeData,1)) = UM.UnicodeNum
    	where UM.CultureN = 'He-IL'
    	UNION ALL
    
    	select unicodeData, UM.CultureN, ROW_NUMBER() OVER (order by unicodedata Collate SQL_Latin1_General_CP1_CI_AS) RN
    	from Data D
    	left JOIN UnicodeMapping UM on UNICODE(LEFT(D.unicodeData,1)) = UM.UnicodeNum
    	where UM.CultureN = 'En-US'
    	UNION ALL
    
    	select unicodeData, UM.CultureN, ROW_NUMBER() OVER (order by unicodedata Collate Indic_General_100_CI_AI) RN
    	from Data D
    	left JOIN UnicodeMapping UM on UNICODE(LEFT(D.unicodeData,1)) = UM.UnicodeNum
    	where UM.CultureN is null -- fix this for Hindi in your case
    ) t
    order by CultureN, RN

    * if you dont know in advance what languages you have, then you can use dynamic query.

    ** In this solution I find the culture dynamically from the data itself. If you had a language column, then you dont need most of this solution.

    *** If you want to sort one language before the other (as Erland mentioned then you can use his approach)

    I hope this useful :-)


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]



    Friday, December 5, 2014 2:16 PM
    Moderator
  • This is a "All-In-One" solution you can use based on my explanation above:

    select unicodeData, CultureN, 
    	RN = 
    	CASE    
    		when CultureN = 'He-IL' then RN_He
    		when CultureN = 'En-US' then RN_En
    		else RN_Else
    	END
    from (	select 
    		unicodeData, 
    		UM.CultureN, 
    		ROW_NUMBER() OVER (order by unicodedata Collate Hebrew_CI_AS) RN_He,
    		ROW_NUMBER() OVER (order by unicodedata Collate Indic_General_100_CI_AI) RN_Else,
    		ROW_NUMBER() OVER (order by unicodedata Collate SQL_Latin1_General_CP1_CI_AS) RN_En
    	from Data D
    	left JOIN UnicodeMapping UM on UNICODE(LEFT(D.unicodeData,1)) = UM.UnicodeNum
    )t
    order by CultureN,RN

    ** Again, if you don't know the languages that you have in the table, then you can use this solution with dynamic query

    ** just want to say that I am writing article regarding this issue...
    check TechNet WIKI for more information :-)


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]

    Friday, December 5, 2014 2:50 PM
    Moderator