none
group by languages in multilingual table in sql server RRS feed

  • Question

  • Hi ,

    I am having a multilingual table in SQL server 2008 , the table has two columns ID and Text

    The text column has English, Chinese and other language texts.

    I need a resultset grouped by the languages and count of id like

    Language   count_of_id

    English          25

    Chinese         10

    other languages 3

    Is this possible? Can you please help me?

    Wednesday, January 7, 2015 7:15 AM

Answers

  • Good day SqlServer_learn

    I have a saying that I always uses: anything is possible in developing, if you have the appropriate resources (change the existing solution could be part of the way to solve...)

    regarding your question, there is simple solution, but for most cases, I highly recommend to change the table structure and add a column for the culture of the text (like: en-us for english, he-il for Hebrew and so on..).

    Since you are using unicode column like nvarchar to store multi language text, we can get the language from the text itself, as long as it include characters from that language (text which include only numbers for example we we consider as default language since it is the same in all languages). 

    step 1: First you need an accessory table (Named like UnicodeMapping) which include all unicode characters and the number of the char in unicode (a mapping unicode table). You can use ranges as well, but it will be faster for the queries if you actually have all the characters, and not just range.

    For example this table (I added English and Hebrew... Do the same with all the languages that you need):

    create table UnicodeMapping (Charecter nchar(1), UnicodeNum int, CultureN NVARCHAR(100), CollateN NVARCHAR(100))
    GO
    -- fill the table with main Hebrew characters, using a number table
    insert UnicodeMapping (Charecter, UnicodeNum, CultureN, CollateN)
    select NCHAR(n), n, 'He-IL', 'Hebrew_CI_AS'
    from _ArielyAccessoriesDatabase.dbo.ArielyNumbers
    where 
    	n between 1488 and 1514 -- Hebrew 
    	or n between 64304 and 64330 -- Hebrew 
    GO
    -- fill the table with main English characters, using a number table
    insert UnicodeMapping (Charecter, UnicodeNum, CultureN, CollateN)
    select NCHAR(n), n, 'En-US', 'SQL_Latin1_General_CP1_CI_AS'
    from _ArielyAccessoriesDatabase.dbo.ArielyNumbers
    where 
    	n between 97 and 122 -- En
    	or n between 65 and 90 -- En 
    GO
    -- Do the same with all the languages that you need, and all the UNICODE ranges for those languages
    
    select * from UnicodeMapping
    GO

    Step 2: You can create a function which get NVARCHAR as input and return the culture as output, or work directly on the data using JOIN your table and this table.

    Assuming that each row is in specific language, In order to recognize the language, you just need to check 1 character from the original string (a text character and not a number for example which might be in any language) and examine which language this single character is, using the our UnicodeMapping.

    You can check this thread to see an implementation of this idea: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/ccc1d16f-926f-46c8-8579-b2eecf661e7c/sort-miultiple-language-data-in-sql-serevr-by-collation?forum=transactsql

    * dont forget to add to the table all the characters like numbers and chose them as your default language

    * in the link above I just select the first character using LEFT, but if the text start with number for example then you will get default language. If you sure that the text must start with real language character then it is best solution, but if not, than It is better to use a "user defined function" which will find the first character that is not in the default language. if the function do not find any char in non-default language than it return default language, else it check the language using the UnicodeMapping and return it.


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]


    Wednesday, January 7, 2015 8:07 AM
    Moderator

All replies

  • Not Possible in SQL Server.

    - Vishal

    SqlAndMe.com

    Wednesday, January 7, 2015 7:27 AM
  • Good day SqlServer_learn

    I have a saying that I always uses: anything is possible in developing, if you have the appropriate resources (change the existing solution could be part of the way to solve...)

    regarding your question, there is simple solution, but for most cases, I highly recommend to change the table structure and add a column for the culture of the text (like: en-us for english, he-il for Hebrew and so on..).

    Since you are using unicode column like nvarchar to store multi language text, we can get the language from the text itself, as long as it include characters from that language (text which include only numbers for example we we consider as default language since it is the same in all languages). 

    step 1: First you need an accessory table (Named like UnicodeMapping) which include all unicode characters and the number of the char in unicode (a mapping unicode table). You can use ranges as well, but it will be faster for the queries if you actually have all the characters, and not just range.

    For example this table (I added English and Hebrew... Do the same with all the languages that you need):

    create table UnicodeMapping (Charecter nchar(1), UnicodeNum int, CultureN NVARCHAR(100), CollateN NVARCHAR(100))
    GO
    -- fill the table with main Hebrew characters, using a number table
    insert UnicodeMapping (Charecter, UnicodeNum, CultureN, CollateN)
    select NCHAR(n), n, 'He-IL', 'Hebrew_CI_AS'
    from _ArielyAccessoriesDatabase.dbo.ArielyNumbers
    where 
    	n between 1488 and 1514 -- Hebrew 
    	or n between 64304 and 64330 -- Hebrew 
    GO
    -- fill the table with main English characters, using a number table
    insert UnicodeMapping (Charecter, UnicodeNum, CultureN, CollateN)
    select NCHAR(n), n, 'En-US', 'SQL_Latin1_General_CP1_CI_AS'
    from _ArielyAccessoriesDatabase.dbo.ArielyNumbers
    where 
    	n between 97 and 122 -- En
    	or n between 65 and 90 -- En 
    GO
    -- Do the same with all the languages that you need, and all the UNICODE ranges for those languages
    
    select * from UnicodeMapping
    GO

    Step 2: You can create a function which get NVARCHAR as input and return the culture as output, or work directly on the data using JOIN your table and this table.

    Assuming that each row is in specific language, In order to recognize the language, you just need to check 1 character from the original string (a text character and not a number for example which might be in any language) and examine which language this single character is, using the our UnicodeMapping.

    You can check this thread to see an implementation of this idea: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/ccc1d16f-926f-46c8-8579-b2eecf661e7c/sort-miultiple-language-data-in-sql-serevr-by-collation?forum=transactsql

    * dont forget to add to the table all the characters like numbers and chose them as your default language

    * in the link above I just select the first character using LEFT, but if the text start with number for example then you will get default language. If you sure that the text must start with real language character then it is best solution, but if not, than It is better to use a "user defined function" which will find the first character that is not in the default language. if the function do not find any char in non-default language than it return default language, else it check the language using the UnicodeMapping and return it.


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]


    Wednesday, January 7, 2015 8:07 AM
    Moderator