Fast Replace for Special Characters in Text
-
samedi 14 avril 2012 19:39Modérateur
What is the fastest T-SQL ( not interested in CLR ) method of replacing special characters in string with * ? Thanks.
Example - replace special characters (not Alphanumeric, not - , etc.. like ó) with * :
SELECT LastName FROM AdventureWorks.Person.Contact WHERE LastName LIKE '%[^0-9a-zA-Z'' -]%' COLLATE SQL_Latin1_General_Cp850_BIN ORDER BY 1 /* LastName Javier Castrejón Jiménez Jiménez Jiménez ....*/
SELECT LastName FROM AdventureWorks2008.Person.Person WHERE LastName LIKE '%[^0-9a-zA-Z'' -]%' COLLATE SQL_Latin1_General_Cp850_BIN ORDER BY 1 /* LastName Javier Castrejón Jiménez Jiménez Jiménez ....*/
Kalman Toth SQL SERVER & BI TRAINING
- Modifié Kalman TothMicrosoft Community Contributor, Moderator dimanche 15 avril 2012 05:00
Toutes les réponses
-
samedi 14 avril 2012 20:01
Fastest way would be with a CLR proc and regex. But if you want to stay inside T-SQL then the Tally Table method proposed here looks as good as anything.
Chuck
- Marqué comme réponse Kalman TothMicrosoft Community Contributor, Moderator vendredi 20 avril 2012 18:49
-
samedi 14 avril 2012 20:17
I agree with Chuck - CLR/regex is the fastest and lightest way on CPU. Below is a really bad way of doing it (but it was kind of fun to do) in T-SQL.
WITH CTE AS( SELECT ROW_NUMBER() OVER (ORDER BY object_id) RowID FROM sys.columns ), CTE2 AS ( SELECT DISTINCT LastName, C.RowID, CASE WHEN ASCII(X.chr) BETWEEN 65 AND 90 OR ASCII(X.chr) BETWEEN 97 AND 122 OR ASCII(X.chr) = 32 THEN X.chr ELSE '*' END FinalChar FROM AdventureWorks2008R2.Person.Person P CROSS APPLY (SELECT RowID FROM CTE WHERE RowID <= LEN(P.LastName)) C CROSS APPLY (SELECT SUBSTRING(P.LastName,C.RowID,1)) X (chr) WHERE P.LastName LIKE '%[^0-9a-zA-Z'' -]%' COLLATE SQL_Latin1_General_Cp850_BIN ) SELECT DISTINCT REPLACE(String,' ',' ') FROM CTE2 C CROSS APPLY (SELECT FinalChar + '' FROM CTE2 C2 WHERE C.LastName = C2.LastName FOR XML PATH ('')) X (String)(Sorry I dodn't have the same AdventureWorks DB as you - but you get the idea).
Simon
- Marqué comme réponse Kalman TothMicrosoft Community Contributor, Moderator vendredi 20 avril 2012 18:47
-
dimanche 15 avril 2012 14:47
Below seems likely to be reasonably efficient. The extre recursive calls will only happen if there is more than one undesired character, and chances are more often than not, that won't happen too often.
I decided against the collate sequence and just went with specifically listing the "allowed" characters rather than using a range, because the dash at the end added a degree of potential confusion, and also, at times simpler is better/less fancy is more.
Setup: Declare @TT Table (ID int identity, Lastname varchar(99)) Insert @TT Select 'Adams' UNION ALL Select 'Baker' UNION ALL Select 'Carlton$' UNION ALL Select 'David$on' UNION ALL Select 'E$w@rd%?n' UNION ALL Select '(Franklin)' UNION ALL Select '%%!@#$%^&*()_Gregory+_)(*&^%$#@!' QUERY: ;With CT1 (OLastName, LastName) as ( SELECT LastName, Replace(LastName, SubString(LastName, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName), 1), '*') FROM @TT Where PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName) >0 --Select * from CT2 ) , CT2 (OLastName, Lastname, NextPos, PassNo) as ( Select OLastName, LastName, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName) , 1 From CT1 Union All Select OLastName, Replace(LastName, SubString(LastName, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName), 1), '*') , PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName) , PassNo + 1 From CT2 Inner Join (Select 1 as Stopper) DQ on PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName) > 0 ) Select CT2.LastName, CT2.OLastName From CT2 Inner Join (Select OLastName, Max(PassNo) as MaxPassNo from CT2 group by OLastName) as CTMax on CTMax.MaxPassNo = Ct2.PassNo and CTMAX.OLastName = CT2.OLastname- Modifié johnqflorida dimanche 15 avril 2012 14:53 Removed unneeded comment
- Marqué comme réponse Kalman TothMicrosoft Community Contributor, Moderator vendredi 20 avril 2012 18:46
-
dimanche 15 avril 2012 15:31
EDITED: See comments at end of this post.
Also, just to keep an open mind and use lateral thinking: This is actually even faster. (To be 100 percent reliable, I suppose repeat the outer applies the same number of times as the length of the lastname field). But technically, this is faster (my actual execution plan for my example was 3.5 times faster, and the speed didn't really vary as I added more outer applies).
EXTREME_NESTING: SELECT 'show', LastName, lastnameCA11 , Replace(LastName , SubString(LastName, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName), 1), '*') , * FROM @TT Outer Apply (Select Replace(LastName, SubString(LastName, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName), 1), '*') as LastNameCA1 , PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastName) as NextPosCa1) CA1 Outer Apply (Select Replace(LastNameCA1, SubString(LastNameCA1, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA1), 1), '*') as LastNameCA2 , PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA1) as NextPosCa2) CA2 Outer Apply (Select Replace(LastNameCA2, SubString(LastNameCA2, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA2), 1), '*') as LastNameCA3 , PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA2) as NextPosCa3) CA3 Outer Apply (Select Replace(LastNameCA3, SubString(LastNameCA3, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA3), 1), '*') as LastNameCA4 , PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA3) as NextPosCa4) CA4 Outer Apply (Select Replace(LastNameCA4, SubString(LastNameCA4, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA4), 1), '*') as LastNameCA5 , PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA4) as NextPosCa5) CA5 Outer Apply (Select Replace(LastNameCA5, SubString(LastNameCA5, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA5), 1), '*') as LastNameCA6 , PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA5) as NextPosCa6) CA6 Outer Apply (Select Replace(LastNameCA6, SubString(LastNameCA6, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA6), 1), '*') as LastNameCA7 , PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA6) as NextPosCa7) CA7 Outer Apply (Select Replace(LastNameCA7, SubString(LastNameCA7, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA7), 1), '*') as LastNameCA8 , PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA7) as NextPosCa8) CA8 Outer Apply (Select Replace(LastNameCA8, SubString(LastNameCA8, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA8), 1), '*') as LastNameCA9 , PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA8) as NextPosCa9) CA9 Outer Apply (Select Replace(LastNameCA9, SubString(LastNameCA9, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA9), 1), '*') as LastNameCA10 , PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA9) as NextPosCa10) CA10 Outer Apply (Select Replace(LastNameCA10, SubString(LastNameCA10, PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA10), 1), '*') as LastNameCA11 , PatIndex('%[^-01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'' *]%', LastNameCA10) as NextPosCa8) CA11EDIT/DISCLAIMER: <strike>I think</strike> I'm going to have to retract this response (the multiple nested outer applies) until further research. At home on SQL 2012 Express, it performs very unpredictably. It may be a memory hog even if the execution plan looks good, but I'm not quite sure yet: I'm still not 100 percent convinced there's not some other issue with my home SQL Express installation, but I don't know. So for now, I'm leaving this post up, but adding this disclaimer until further notice. The CTE version (my previous reply) still seems to be fine.
EDIT part II: Definitely retracting. Interestingly, the estimated AND actual plans for this version indicate a much faster execution time, but that's only part of the story: In actually running this query, the extra outer-applies slow things down with each layer. In my haste, I had posted this reply just looking at the actual plan (because the whole query ran slower anyway, trying to generate the actual plan), but I hadn't tested just running the query by itself, and that scenario was slower. I've since tweaked and sped things up a bit by checking for 0 in previous outer apply results, but it's still slower than the original CTE, so until or unless it gets faster, it's staying retracted.
I think the subject matter is interesting enough to leave the post up, though.
- Modifié johnqflorida lundi 16 avril 2012 11:51 Retracted this post
-
lundi 16 avril 2012 11:44Please see the EDIT comments in the reply for "Extreme Nesting".
-
vendredi 20 avril 2012 18:50ModérateurThanks everybody for the contributions.
Kalman Toth SQL SERVER & BI TRAINING

