locked
SQL Server Unicode Characters in a column RRS feed

  • Question

  • I have an ETL and it fails because of some Unicode Characters. The unicode characters are getting inserted from somewhere in the application and we are still trying to ascertain the source of the same.But till then I need to run a T-SQL Query and find out the rows( I know the column name ) that are causing this problem, modify the data to the correct format and then run the ETL.

    Would i need to use a cursor? can anyone please advise?

    It was a bit urgent thus any help would be great!

    Thanks

    Nimish

    Tuesday, August 31, 2010 5:48 PM

Answers

  • Don't know whether this will work or not... don't have access to SQL at the moment...

    SELECT TheColumn
    FROM MyTable
    WHERE TheColumn<>CONVERT(nvarchar(100),CONVERT(varchar(100),TheColumn))

    In other words, if you convert the column to VARCHAR and then back to NVARCHAR, is the value the same?  If not, then that column has unicode characters.


    --Brad (My Blog)
    • Proposed as answer by Naomi N Tuesday, August 31, 2010 6:50 PM
    • Marked as answer by Kalman Toth Monday, September 13, 2010 7:44 AM
    Tuesday, August 31, 2010 6:06 PM

All replies

  • Don't know whether this will work or not... don't have access to SQL at the moment...

    SELECT TheColumn
    FROM MyTable
    WHERE TheColumn<>CONVERT(nvarchar(100),CONVERT(varchar(100),TheColumn))

    In other words, if you convert the column to VARCHAR and then back to NVARCHAR, is the value the same?  If not, then that column has unicode characters.


    --Brad (My Blog)
    • Proposed as answer by Naomi N Tuesday, August 31, 2010 6:50 PM
    • Marked as answer by Kalman Toth Monday, September 13, 2010 7:44 AM
    Tuesday, August 31, 2010 6:06 PM
  • What is the type of the column? Also, can you be more specific about why Unicode is causing a problem? Generally when Unicode is inserted into a non-Unicode string column, the characters are converted according to the collation/codepage in effect, and non-convertible values are changed to question marks ('?'). For example:

    declare @unicode nvarchar(10) = N'你好';
    
    declare @t table (
     c varchar(10) collate Latin1_General_CI_AI
    );
    
    insert into @t values (@unicode);
    select c, cast(c as binary(10)) as c_bin from @t;
    

    Presumably your ETL process is designed to fail under certain conditions. If you can explain what that condition is and describe the source column type, it may be easier to make a suggestion. 

    Steve Kass
    http://www.stevekass.com

    Tuesday, August 31, 2010 7:28 PM
  • SELECT * FROM tbl WHERE col LIKE '%' + nchar(nnn) '%'
    COLLATE Latin1_General_BIN2

    Where nnn is the decimal Unicode code for the problematic character.

    If you are on SQL 2008, use Latin1_General_100_BIN2, as the newer collation includes more characters.

    Then again, if you have an nvarchar column, Unicode characters is all you have!


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Links for SQL Server Books Online:
    SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
    SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
    SQL 2000: http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
    Tuesday, August 31, 2010 10:10 PM
  • Hi Brad

    Thanks for the response, but the number of rows in the table is huge. Thus running this query and then checking each row to see if the values are similar or not seems like a difficult proposition.

    What we need to do is, find the rows that are causing the problem and then send the rows to the business owners who would fix it i.e. remove the unicode characters.

    Thanks

    Nimish

    Wednesday, September 1, 2010 12:39 AM
  • Hi Steve

    Thanks for the response.

    In the source database:

    a. Column Datatype : nvarchar ; length 510

    b. Collation : Latin1_General_CI_AS

    in the destination DB:

    a. Column Datatype : varchar ; length 125

    b. Collation : Latin1_General_CI_AS

    In the ETL I am failing it for truncation and error, What we need to do is, find the rows that are causing the problem and then send the rows to the business owners who would fix it i.e. remove the unicode characters.Since we cannot have invalid characters in this column as we are using it for reporting and thus all the invalid data would show up in the reports.

    Thanks

    Nimish

    Wednesday, September 1, 2010 12:46 AM
  • Hi Erland

    We get the table from the business and thus have no clue as to what the unicode characters would be?

    Can you suggest a decimal unicode range which would help me checking for all the characters in the column? And how can i check for that condition through your above code?

    Some information :

    In the source database:

    a. Column Datatype : nvarchar ; length 510

    b. Collation : Latin1_General_CI_AS

    in the destination DB:

    a. Column Datatype : varchar ; length 125

    b. Collation : Latin1_General_CI_AS

    Thanks

    Nimish

     

     

    Wednesday, September 1, 2010 12:50 AM
  • Hello Nimish,

    In your ETL (SSIS), Data flow , use Derived column and covert your input column to the required data type(destination data type). Then in configuration Error output , Redirect the rows on truncation.(Whatever rows will not be able to convert to the correct data, have unicode will be redirected.).Actually in most of ETL , they use suspend table to keep these kind of records and then send a report at the end of process to business user with some kind of reason Code.

    Meanwhile you redirect these rows , use another Derived Column and put some Code(SuspendReasonCode) and insert into SuspendTable and write a reports at the end of SSIS package.

     

    Thanks
    Wednesday, September 1, 2010 1:00 AM
  • Hi Aamir

    Thanks for your response.The business owners want to run the ETL After they correct the data.i.e. remove the unicode characters and then run the ETL.I was looking at a T-SQL Statement that would help me find the invalid rows (containing the unicode characters ) get them changed and then run the ETL.

    Thanks

    Nimish

    Wednesday, September 1, 2010 1:08 AM
  • Declare
    
     @t table
    
     ( uni nvarchar
    
    (510))
    insert
    
     into
    
     @t values
    
    (N'TEST'
    
    )
    insert
    
     into
    
     @t values
    
    (N'Ů'
    
    )
    insert
    
     into
    
     @t values
    
    (N'Ŧ'
    
    )
    insert
    
     into
    
     @t values
    
    (N'1234'
    
    )
    insert
    
     into
    
     @t values
    
    (N'123TEST'
    
    )
    insert
    
     into
    
     @t values
    
    (N'TESTŴTEST'
    
    )
    
    Select
    
     * from
    
     @t
    
    
    Select
    
     * from
    
     @t
    where
    
     uni<> CONVERT
    
    (varchar
    
    (125),uni)
    
    output
    
     --Rows with UniCode in data
    
    
    uni
    Ů
    Ŧ
    TESTŴTEST
    

    --Still you have to think about the length of your data, you have nvarchar(510) and you are converting to varchar(125), maybe some of input coming more than varhcar(125)

     

    Edited: Nimish, you have mentioned above that you have alot of rows. But You are going to use only one column in the query and you are not returning huge results set, you will be getting only rows those have unicode in them, so it would not take that long. Give a try and see how much time it take,let us know your results.

    Thanks

    Wednesday, September 1, 2010 1:18 AM
  • We get the table from the business and thus have no clue as to what the unicode characters would be?

    Can you suggest a decimal unicode range which would help me checking for all the characters in the column? And how can i check for that condition through your above code?


    No, I can't. I have absolutely no idea which characters you think are invalid and which are not. You talk about "Unicode characters", but as I said all characters are Unicode characters.

    A query like this:

    SELECT ch COLLATE Latin1_General_BIN2, COUNT(*)
    FROM   tbl
    CROSS  APPLY (SELECT ch = substring(col, Number, 1)
                  FROM   Numbers
                  WHERE  Number BETWEEN 1 AND len(col) AS x
    GROUP  BY ch
    ORDER  BY 2 

    Will give you a tally of the characters in the column. Supposedly the unwanted character(s) are low in frequency.

    You find the Numbers table here: http://www.sommarskog.se/arrays-in-sql-2005.html#tblnum


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Links for SQL Server Books Online:
    SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
    SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
    SQL 2000: http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
    Wednesday, September 1, 2010 7:48 AM