Fast Replace for Special Characters in Text

Answered Fast Replace for Special Characters in Text

  • Saturday, April 14, 2012 7:39 PM
    Moderator
     
      Has Code

    What is the fastest T-SQL ( not interested in CLR ) method of replacing special characters in string with * ? Thanks.

    Example - replace special characters (not Alphanumeric, not - , etc.. like ó) with * :

    SELECT LastName 
    FROM AdventureWorks.Person.Contact 
    WHERE LastName LIKE '%[^0-9a-zA-Z'' -]%' COLLATE SQL_Latin1_General_Cp850_BIN
    ORDER BY 1
    /*
    LastName
    Javier Castrejón
    Jiménez
    Jiménez
    Jiménez ....*/
    SELECT LastName 
    FROM AdventureWorks2008.Person.Person 
    WHERE LastName LIKE '%[^0-9a-zA-Z'' -]%' COLLATE SQL_Latin1_General_Cp850_BIN
    ORDER BY 1
    /*
    LastName
    Javier Castrejón
    Jiménez
    Jiménez
    Jiménez ....*/



    Kalman Toth SQL SERVER & BI TRAINING




All Replies

  • Saturday, April 14, 2012 8:01 PM
     
     Answered

    Fastest way would be with a CLR proc and regex.   But if you want to stay inside T-SQL then the Tally Table method proposed here looks as good as anything.

    http://ask.sqlservercentral.com/questions/6518/help-using-replace-with-wildcard-matching-pattern.html


    Chuck

  • Saturday, April 14, 2012 8:17 PM
     
     Answered Has Code

    I agree with Chuck - CLR/regex is the fastest and lightest way on CPU. Below is a really bad way of doing it (but it was kind of fun to do) in T-SQL.

    WITH CTE AS(
    	SELECT ROW_NUMBER() OVER (ORDER BY object_id) RowID 
    	FROM sys.columns
    	),
    CTE2 AS (
    SELECT DISTINCT LastName, C.RowID, 
    	CASE WHEN ASCII(X.chr) BETWEEN 65 AND 90 OR ASCII(X.chr) BETWEEN 97 AND 122 OR ASCII(X.chr) = 32 THEN X.chr ELSE '*' END FinalChar
    FROM AdventureWorks2008R2.Person.Person P
    CROSS APPLY (SELECT RowID FROM CTE WHERE RowID <= LEN(P.LastName)) C
    CROSS APPLY (SELECT SUBSTRING(P.LastName,C.RowID,1)) X (chr)
    WHERE P.LastName LIKE '%[^0-9a-zA-Z'' -]%' COLLATE SQL_Latin1_General_Cp850_BIN
    )
    SELECT DISTINCT REPLACE(String,'&#x20;',' ')
    FROM CTE2 C
    CROSS APPLY (SELECT FinalChar + ''
    			FROM CTE2 C2
    			WHERE C.LastName = C2.LastName
    			FOR XML PATH ('')) X (String)
    

    (Sorry I dodn't have the same AdventureWorks DB as you - but you get the idea).

    Simon

  • Sunday, April 15, 2012 2:47 PM
     
     Answered Has Code

    Below seems likely to be reasonably efficient. The extre recursive calls will only happen if there is more than one undesired character, and chances are more often than not, that won't happen too often.

    I decided against the collate sequence and just went with specifically listing the "allowed" characters rather than using a range, because the dash at the end added a degree of potential confusion, and also, at times simpler is better/less fancy is more.

    Setup:
    		Declare @TT Table (ID int identity, Lastname varchar(99))
    		Insert @TT Select 'Adams' 
    		 UNION ALL Select 'Baker' 
    		 UNION ALL Select 'Carlton$' 
    		 UNION ALL Select 'David$on' 
    		 UNION ALL Select 'E$w@rd%?n'
    		 UNION ALL Select '(Franklin)'
    		 UNION ALL Select '%%!@#$%^&*()_Gregory+_)(*&^%$#@!'
    
    QUERY:
    		;With CT1 (OLastName, LastName) as
    		(
    			SELECT LastName, Replace(LastName, SubString(LastName, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName), 1), '*')
    			  FROM @TT  Where  PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName) >0  --Select * from CT2
    		)  
    		, CT2 (OLastName, Lastname, NextPos, PassNo)
    		  as
    		  (
    			Select OLastName, LastName,  PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName) , 1
    			  From CT1  
    			Union All Select OLastName, Replace(LastName, SubString(LastName, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName), 1), '*')
    							, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName)  
    							, PassNo + 1
    					   From CT2 
    					  Inner Join (Select 1 as Stopper) DQ on PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName) > 0
    		  )
    		  Select CT2.LastName, CT2.OLastName
    			From CT2
    		   Inner Join (Select OLastName, Max(PassNo) as MaxPassNo from CT2 group by OLastName) as CTMax
    						   on CTMax.MaxPassNo = Ct2.PassNo 
    						  and CTMAX.OLastName = CT2.OLastname  
    		 


  • Sunday, April 15, 2012 3:31 PM
     
      Has Code

    EDITED: See comments at end of this post. 

    Also, just to keep an open mind and use lateral thinking:  This is actually even faster.  (To be 100 percent reliable, I suppose repeat the outer applies the same number of times as the length of the lastname field).  But technically, this is faster (my actual execution plan for my example was 3.5 times faster, and the speed didn't really vary as I added more outer applies). 

    EXTREME_NESTING:
    		SELECT 'show',  LastName, lastnameCA11
    				, Replace(LastName
    				          , SubString(LastName, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName), 1), '*')
    				, *
    		  FROM @TT
    		Outer  Apply (Select Replace(LastName, SubString(LastName, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName), 1), '*') as LastNameCA1
    						, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName) as NextPosCa1) CA1
    		Outer  Apply (Select Replace(LastNameCA1, SubString(LastNameCA1, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA1), 1), '*') as LastNameCA2
    						, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA1) as NextPosCa2) CA2
    		Outer  Apply (Select Replace(LastNameCA2, SubString(LastNameCA2, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA2), 1), '*') as LastNameCA3
    						, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA2) as NextPosCa3) CA3
    		Outer  Apply (Select Replace(LastNameCA3, SubString(LastNameCA3, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA3), 1), '*') as LastNameCA4
    						, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA3) as NextPosCa4) CA4
    		Outer  Apply (Select Replace(LastNameCA4, SubString(LastNameCA4, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA4), 1), '*') as LastNameCA5
    						, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA4) as NextPosCa5) CA5
    		Outer  Apply (Select Replace(LastNameCA5, SubString(LastNameCA5, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA5), 1), '*') as LastNameCA6
    						, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA5) as NextPosCa6) CA6
    		Outer  Apply (Select Replace(LastNameCA6, SubString(LastNameCA6, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA6), 1), '*') as LastNameCA7
    						, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA6) as NextPosCa7) CA7
    		Outer  Apply (Select Replace(LastNameCA7, SubString(LastNameCA7, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA7), 1), '*') as LastNameCA8
    						, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA7) as NextPosCa8) CA8
    		Outer  Apply (Select Replace(LastNameCA8, SubString(LastNameCA8, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA8), 1), '*') as LastNameCA9
    						, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA8) as NextPosCa9) CA9
    		Outer  Apply (Select Replace(LastNameCA9, SubString(LastNameCA9, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA9), 1), '*') as LastNameCA10
    						, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA9) as NextPosCa10) CA10
    		Outer  Apply (Select Replace(LastNameCA10, SubString(LastNameCA10, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA10), 1), '*') as LastNameCA11
    						, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA10) as NextPosCa8) CA11

    EDIT/DISCLAIMER: <strike>I think</strike> I'm going to have to retract this response (the multiple nested outer applies) until further research.  At home on SQL 2012 Express, it performs very unpredictably.  It may be a memory hog even if the execution plan looks good, but I'm not quite sure yet:  I'm still not 100 percent convinced there's not some other issue with my home SQL Express installation, but I don't know.  So for now, I'm leaving this post up, but adding this disclaimer until further notice.  The CTE version (my previous reply) still seems to be fine.

    EDIT part II: Definitely retracting.  Interestingly, the estimated AND actual plans for this version indicate a much faster execution time, but that's only part of the story: In actually running this query, the extra outer-applies slow things down with each layer.  In my haste, I had posted this reply just looking at the actual plan (because the whole query ran slower anyway, trying to generate the actual plan), but I hadn't tested just running the query by itself, and that scenario was slower.  I've since tweaked and sped things up a bit by checking for 0 in previous outer apply results, but it's still slower than the original CTE, so until or unless it gets faster, it's staying retracted.

    I think the subject matter is interesting enough to leave the post up, though. 

    • Edited by johnqflorida Monday, April 16, 2012 11:51 AM Retracted this post
    •  
  • Monday, April 16, 2012 11:44 AM
     
     
    Please see the EDIT comments in the reply for "Extreme Nesting". 
  • Friday, April 20, 2012 6:50 PM
    Moderator
     
     
    Thanks everybody for the contributions.

    Kalman Toth SQL SERVER & BI TRAINING