locked
Encoding Issue RRS feed

  • Question

  • Hi all,

    I have scenarion where i have a column which is of Nvarchar(Max) ehich contains data with normal characters and also unicode characters. I need to query such that i shud fetch the records which has data other than normal charcters. 

    For Example: I need to fetch those records which has characters like ÖZBFWipfÖAPG'pAPGúsópe


    Please have look on the comment


    • Edited by Santhosh H Monday, May 20, 2013 11:17 AM
    Monday, May 20, 2013 11:15 AM

Answers

  • What is a normal character is the eye of the beholder. For me Ö is a very normal character! W on the other hand is a kind of strange thing they use in foriegn words.

    If we assume that are you are looking for characters outside the ASCII character set, you can write the query as:

    SELECT col FROM tbl
    WHERE col COLLATE Latin1_General_BIN2 LIKE '%[^' + char(32) + '-' + char(126) + ']'

    The key is that you should use a binary collation so that you could work with charcater codes. But depending what you think is a normal character, you may have to modify the pattern.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    • Marked as answer by Santhosh H Tuesday, May 21, 2013 6:28 AM
    Monday, May 20, 2013 11:54 AM
  • Here is Erland's method, it picks up all the extended ASCII range (128-255) text. I added two normal ascii words. My method only works with true UNICODE (two bytes) letters.

    DECLARE @test TABLE (textstring nvarchar(max), ID INT IDENTITY(1,1) PRIMARY KEY);
    INSERT @test VALUES
     (N'Wolford AG Ä')
    ,(N'Landesrechnungshof OÖ')
    ,(N'Landesrechnungshof O')
    ,(N'VPTÖ')
    ,(N'ARBÖ')
    ,(N'ARBO')
    ,(N'Gebietskrankenkasse NÖ')
    ,(N'Immorent O.Ö')
    ,(N'Kammer Gew W NÖ')
    ,(N'EVN Energieversgg.N.Ö')
    ,(N'RÖTHLIN Wirtschaftsprü')
    ,(N'FAB Reno OÖ')
    ,(N'Geb Krankenk OÖ')
    ,(N'Geb Krankenk NÖ')
    ,(N'Hubertusbräu Johann Kü')
    ,(N'Softwaremanufaktur Grü')
    ,(N'GESPAG OÖ')
    ,(N'Fachhochschulstudiengä')
    ,(N'Fachhochschulstudiengä')
    ,(N'AS Handel EDV & Bauträ')
    ,(N'Lds-Schulrat NÖ')
    ,(N'IQSOFT Gesellschaft fü')
    ,(N'Superfund OÖ')
    ,(N'DPT. 781 0A  SO.Ü')
    ,(N'WIFI OÖ')
    ,(N'WIFI NÖ')
    ,(N'Wifi-OÖ')
    ,(N'INACTIVE/BOGAD HANS Jü')
    ,(N'Media Data OÖ')
    ,(N'Lexmark-Transit /IBM Ö')
    ,(N'INACTIVE/LEXMARK/IBM Ö')
    ,(N'Versicherungsverband Ö')
    ,(N'INACTIVE/LANZ DKFM. Gü')
    ,(N'Landesschulrat für NÖ')
    ,(N'Dvo OÖ')
    ,(N'INACTIVE/MüLLER MAG Gü')
    ,(N'Wirtschaftskammer NÖ')
    ,(N' Ä rztekammer NÖ')
    ,(N' Ä rztek. OÖ')
    ,(N'Brandverhütung OÖ')
    ,(N'Allianz Elementar NÖ')
    ,(N'Landesregierung NÖ')
    ,(N'Sparkasse NÖ')
    ,(N'Ka Arb Ang f NÖ')
    
    SELECT textstring FROM @test 
    WHERE NOT( textstring COLLATE Latin1_General_BIN2 LIKE '%[^' + char(32) + '-' + char(126) + ']');
    /*
    textstring
    Landesrechnungshof O
    ARBO
    */


    Kalman Toth Database & OLAP Architect sqlusa.com
    New Book / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012


    • Edited by Kalman Toth Monday, May 20, 2013 12:51 PM
    • Marked as answer by Santhosh H Tuesday, May 21, 2013 6:28 AM
    Monday, May 20, 2013 12:50 PM

All replies

  • What is a normal character is the eye of the beholder. For me Ö is a very normal character! W on the other hand is a kind of strange thing they use in foriegn words.

    If we assume that are you are looking for characters outside the ASCII character set, you can write the query as:

    SELECT col FROM tbl
    WHERE col COLLATE Latin1_General_BIN2 LIKE '%[^' + char(32) + '-' + char(126) + ']'

    The key is that you should use a binary collation so that you could work with charcater codes. But depending what you think is a normal character, you may have to modify the pattern.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    • Marked as answer by Santhosh H Tuesday, May 21, 2013 6:28 AM
    Monday, May 20, 2013 11:54 AM
  • Try:

    WHERE textstring != convert(varchar(max), textstring)


    Kalman Toth Database & OLAP Architect sqlusa.com
    New Book / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012

    Monday, May 20, 2013 11:56 AM
  • Hey Ernald,

    Yes i want to check those recods which doesn't have normal ascii characters like A-Z(both upper and lower) and 0-9.

    @Kalam: the query did not wrk.

    Thanks

    Santhosh


    Please have look on the comment

    Monday, May 20, 2013 12:04 PM
  • "WHERE textstring != convert(varchar(max), textstring)" works if you want to catch characters outside your code page, but you need to force a binary collation, or else a character like ń will fly under the radar in an accent-sensitive collation. And if the code page is 1252, Ö will pass with flying colours.



    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Monday, May 20, 2013 12:07 PM
  • Thank u Erland,But the query that u have suggested does  it give me all records which doesnt have normal characters.Or do i need to chnage anything in the query.


    Please have look on the comment


    • Edited by Santhosh H Monday, May 20, 2013 12:12 PM
    Monday, May 20, 2013 12:12 PM
  • Hi Santosh,

    Can you post a script so we can test?  The characters you are posting actually are in the codepage for SQL Server USA default collation, they are not Unicode characters, they can be represented in one byte, called extended ASCII characters. I added the Hungarian Unicode letter ő for the test.

    My method should work with high likelihood (it depends on your unicode text). Otherwise follow Erland's method:

    DECLARE @textstring nvarchar(max) = N'őÖZBFWipfÖAPG''pfÖAPGúsópe'

    SELECT @textstring, convert(varchar(max), @textstring)

    -- őÖZBFWipfÖAPG'pfÖAPGúsópe  oÖZBFWipfÖAPG'pfÖAPGúsópe

    SELECT 'Unicode' WHERE @textstring != convert(varchar(max), @textstring);

    GO  -- Unicode

    DECLARE @textstring varchar(max) = N'New York City'

    SELECT 'nonUnicode' WHERE @textstring = convert(varchar(max), @textstring);

    GO  -- nonUnicode




    Kalman Toth Database & OLAP Architect sqlusa.com
    New Book / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012





    Monday, May 20, 2013 12:13 PM
  • Hey Kalam,

    The characters like " Ö,ú,ó " are in  between the actual data and I want to take only those records which contains characters other than ASCII values and the column that I am trying to query is on NVARCHAR(MAx) data type.


    Please have look on the comment

    Monday, May 20, 2013 12:22 PM
  • I suggest the query, and as long as you don't supply a sufficient onus of sample data, the onus of testing is with you!

    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Monday, May 20, 2013 12:24 PM
  • How about Erland's method? That should work.

    SELECT ASCII('O')  -- 79          ASCII range
    
    SELECT ASCII('Ö')  -- 214         Extended ASCII range
    
    SELECT UNICODE(N'ő')  -- 337      UNICODE 


    Kalman Toth Database & OLAP Architect sqlusa.com
    New Book / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012


    Monday, May 20, 2013 12:25 PM
  • Hey Erland ,

    Since I have around 7 lakhs of records its very difficult to search for data with Non ascii values and my req is also to take out those values using a query. I tested with tour query and i could get some 48 records out of 7laks and data that i could get is as below.

    Wolford AG Ä
    Landesrechnungshof OÖ
    VPTÖ
    ARBÖ
    Gebietskrankenkasse NÖ
    Immorent O.Ö
    Kammer Gew W NÖ
    EVN Energieversgg.N.Ö
    RÖTHLIN Wirtschaftsprü
    FAB Reno OÖ
    Geb Krankenk OÖ
    Geb Krankenk NÖ
    Hubertusbräu Johann Kü
    Softwaremanufaktur Grü
    GESPAG OÖ
    Fachhochschulstudiengä
    Fachhochschulstudiengä
    AS Handel EDV & Bauträ
    Lds-Schulrat NÖ
    IQSOFT Gesellschaft fü
    Superfund OÖ
    DPT. 781 0A  SO.Ü
    WIFI OÖ
    WIFI NÖ
    Wifi-OÖ
    INACTIVE/BOGAD HANS Jü
    Media Data OÖ
    Lexmark-Transit /IBM Ö
    INACTIVE/LEXMARK/IBM Ö
    Versicherungsverband Ö
    INACTIVE/LANZ DKFM. Gü
    Landesschulrat für NÖ
    Dvo OÖ
    INACTIVE/MüLLER MAG Gü
    Wirtschaftskammer NÖ
     Ä rztekammer NÖ
     Ä rztek. OÖ
    Brandverhütung OÖ
    Allianz Elementar NÖ
    Landesregierung NÖ
    Sparkasse NÖ
    Ka Arb Ang f NÖ

    But i am sure there are more than 2k records which has non asccii values.

    @@Kalam: yes, Erlands query worked but I am getting only 48 records out of 7lakhs but there are records more than that.


    Please have look on the comment

    Monday, May 20, 2013 12:30 PM
  • You will need to provide sample of rows you think you should be included in the result set that my queries do not return. There may be 700000 rows, but I can't be stand-in for verification of your data.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Monday, May 20, 2013 12:46 PM
  • Here is Erland's method, it picks up all the extended ASCII range (128-255) text. I added two normal ascii words. My method only works with true UNICODE (two bytes) letters.

    DECLARE @test TABLE (textstring nvarchar(max), ID INT IDENTITY(1,1) PRIMARY KEY);
    INSERT @test VALUES
     (N'Wolford AG Ä')
    ,(N'Landesrechnungshof OÖ')
    ,(N'Landesrechnungshof O')
    ,(N'VPTÖ')
    ,(N'ARBÖ')
    ,(N'ARBO')
    ,(N'Gebietskrankenkasse NÖ')
    ,(N'Immorent O.Ö')
    ,(N'Kammer Gew W NÖ')
    ,(N'EVN Energieversgg.N.Ö')
    ,(N'RÖTHLIN Wirtschaftsprü')
    ,(N'FAB Reno OÖ')
    ,(N'Geb Krankenk OÖ')
    ,(N'Geb Krankenk NÖ')
    ,(N'Hubertusbräu Johann Kü')
    ,(N'Softwaremanufaktur Grü')
    ,(N'GESPAG OÖ')
    ,(N'Fachhochschulstudiengä')
    ,(N'Fachhochschulstudiengä')
    ,(N'AS Handel EDV & Bauträ')
    ,(N'Lds-Schulrat NÖ')
    ,(N'IQSOFT Gesellschaft fü')
    ,(N'Superfund OÖ')
    ,(N'DPT. 781 0A  SO.Ü')
    ,(N'WIFI OÖ')
    ,(N'WIFI NÖ')
    ,(N'Wifi-OÖ')
    ,(N'INACTIVE/BOGAD HANS Jü')
    ,(N'Media Data OÖ')
    ,(N'Lexmark-Transit /IBM Ö')
    ,(N'INACTIVE/LEXMARK/IBM Ö')
    ,(N'Versicherungsverband Ö')
    ,(N'INACTIVE/LANZ DKFM. Gü')
    ,(N'Landesschulrat für NÖ')
    ,(N'Dvo OÖ')
    ,(N'INACTIVE/MüLLER MAG Gü')
    ,(N'Wirtschaftskammer NÖ')
    ,(N' Ä rztekammer NÖ')
    ,(N' Ä rztek. OÖ')
    ,(N'Brandverhütung OÖ')
    ,(N'Allianz Elementar NÖ')
    ,(N'Landesregierung NÖ')
    ,(N'Sparkasse NÖ')
    ,(N'Ka Arb Ang f NÖ')
    
    SELECT textstring FROM @test 
    WHERE NOT( textstring COLLATE Latin1_General_BIN2 LIKE '%[^' + char(32) + '-' + char(126) + ']');
    /*
    textstring
    Landesrechnungshof O
    ARBO
    */


    Kalman Toth Database & OLAP Architect sqlusa.com
    New Book / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012


    • Edited by Kalman Toth Monday, May 20, 2013 12:51 PM
    • Marked as answer by Santhosh H Tuesday, May 21, 2013 6:28 AM
    Monday, May 20, 2013 12:50 PM