locked
SQL script that would identify duplicates and skip in a row RRS feed

  • Question

  • Hi All,

      I hope everyone is safe . . .

    I have kind of weird question. . . Please look at the below two tables

    Table 1

    ID

    Name

    Email1

    Email2

    1

    TEST

    Test1@test.com

    Test2@test.com,test3@test.com,

    2

    TEST2

    Test4@test.com

    Test2@test.com,test3@test.com, Test4@test.com



    When concatenate Email1 and Email2

    ID

    Name

    New Email1s

    1

    TEST

    Test2@test.com,test3@test.com,

    2

    TEST2

    Test4@test.com, Test2@test.com,test3@test.com, Test4@test.com


    So my question is - how can i avoid the highlighted dups when doing the concatination?

    Thanks

    Kiru

    Thursday, August 13, 2020 2:50 PM

Answers

  • Hi Kiruman,

    Please try the following approach.

    XQuery distinct-values() function does exactly what you need.

    -- DDL and sample data population, start
    DECLARE @tbl TABLE (ID INT IDENTITY PRIMARY KEY, email1 VARCHAR(1024), email2 VARCHAR(1024));
    INSERT INTO @tbl (email1, email2) VALUES
    ('Test1@Test.com', 'Test2@test.com, Test3@test.com'),
    ('Test4@Test.com', 'Test2@test.com, Test3@test.com,Test4@Test.com');
    -- DDL and sample data population, end
    
    DECLARE @separator CHAR(1) = ',';
    , SPACE(1),'')
    SELECT ID, Try_CAST('<root><r><![CDATA[' + 
    	REPLACE(REPLACE(LOWER(CONCAT(email1, @separator, email2)), SPACE(1),''), @separator, ']]></r><r><![CDATA[') + 
            ']]></r></root>' AS XML)
    		.query('for $i in distinct-values(/root/r/text())
    			return if ($i eq (distinct-values(/root/r/text())[last()])[1]) then $i
    				else concat($i, sql:variable("@separator"))
    		').value('.', 'NVARCHAR(MAX)') AS emailList
    FROM @tbl;

    Output
    ID emailList
    1 Test1@Test.com, Test2@test.com, Test3@test.com
    2 Test4@Test.com, Test2@test.com, Test3@test.com




    • Edited by Yitzhak Khabinsky Thursday, August 13, 2020 5:36 PM
    • Marked as answer by kiruman Thursday, August 13, 2020 6:41 PM
    Thursday, August 13, 2020 4:41 PM
  • This looks like exactly what i want . . . let me apply it in the original script that i was working and will get back to you.

    Thanks much!!

    Hi kiruman,

    Looking forward for your feedback on my solution.

    • Marked as answer by kiruman Thursday, August 13, 2020 6:43 PM
    Thursday, August 13, 2020 6:41 PM

All replies

  • Your question is not very clear. What is the exact formula you're using for New Emails column? Can both of the Email1 and Email2 columns contain multiple emails? 

    It sounds like you may want to split both columns first and then do aggregate distinct values after you first splitted both columns.


    Looking for new opportunities

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Thursday, August 13, 2020 3:18 PM
  • If Email1 is a single address and STRING_SPLIT is not available in your environment, then consider this example too:

    declare @mytable table ( ID int, Name varchar(30), EMail1 varchar(30), EMail2 varchar(max) )
    
    insert into @mytable values
    ( 1, 'TEST', 'Test1@test.com', 'Test2@test.com,test3@test.com' ),
    ( 2, 'TEST2', 'Test4@test.com', 'Test2@test.com,test3@test.com, Test4@test.com' )
    
    select ID, Name, 
    	case when 
    		EMail2 = EMail1  or
    		EMail2 LIKE EMail1 + '[ ,]%' or
    		EMail2 LIKE '%[ ,]' + EMail1 + '[ ,]%' or
    		EMail2 LIKE '%[ ,]' + EMail1
    		then EMail2
    		else EMail2 + ', ' + EMail1 
    	end as NewEmails
    from @mytable


    (Needs some adjustments if addresses contain special characters: ‘_’, '%', etc.).



    • Edited by Viorel_MVP Thursday, August 13, 2020 5:25 PM
    Thursday, August 13, 2020 4:03 PM
  • Hi Kiruman,

    What is your SQL Server version?

    SELECT @@VERSION;


    Thursday, August 13, 2020 4:21 PM
  • Hi Kiruman,

    Please try the following approach.

    XQuery distinct-values() function does exactly what you need.

    -- DDL and sample data population, start
    DECLARE @tbl TABLE (ID INT IDENTITY PRIMARY KEY, email1 VARCHAR(1024), email2 VARCHAR(1024));
    INSERT INTO @tbl (email1, email2) VALUES
    ('Test1@Test.com', 'Test2@test.com, Test3@test.com'),
    ('Test4@Test.com', 'Test2@test.com, Test3@test.com,Test4@Test.com');
    -- DDL and sample data population, end
    
    DECLARE @separator CHAR(1) = ',';
    , SPACE(1),'')
    SELECT ID, Try_CAST('<root><r><![CDATA[' + 
    	REPLACE(REPLACE(LOWER(CONCAT(email1, @separator, email2)), SPACE(1),''), @separator, ']]></r><r><![CDATA[') + 
            ']]></r></root>' AS XML)
    		.query('for $i in distinct-values(/root/r/text())
    			return if ($i eq (distinct-values(/root/r/text())[last()])[1]) then $i
    				else concat($i, sql:variable("@separator"))
    		').value('.', 'NVARCHAR(MAX)') AS emailList
    FROM @tbl;

    Output
    ID emailList
    1 Test1@Test.com, Test2@test.com, Test3@test.com
    2 Test4@Test.com, Test2@test.com, Test3@test.com




    • Edited by Yitzhak Khabinsky Thursday, August 13, 2020 5:36 PM
    • Marked as answer by kiruman Thursday, August 13, 2020 6:41 PM
    Thursday, August 13, 2020 4:41 PM
  • Hi Yitzhak,

    What needs to change in your script to be case insensitive? I added test3@test.com into the second row first email and the result was not correct.


    Looking for new opportunities

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Thursday, August 13, 2020 4:47 PM
  • Hi Naomi,

    I updated the answer to accommodate case insensitivity.

    Thursday, August 13, 2020 5:15 PM
  • I thought about it, but this is not desired - e.g. we don't want to actually convert the case of our emails, we just want the distinct check to be case insensitive.

    Looking for new opportunities

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Thursday, August 13, 2020 5:18 PM
  • Hi Naomi,

    It is a norm for emails to be in lower case.

    Let's wait for the original poster feedback.

    Thursday, August 13, 2020 5:28 PM
  • Ha-ha :)

    I guess the issue with the XML approach that it's very powerful, but it's case-sensitive and you can not really make it case-insensitive without lots of tricks.


    Looking for new opportunities

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Thursday, August 13, 2020 6:02 PM
  • Nothing complicated . . . am just trying to combine email1 and email2 by using concatenation and while doing that i want to make sure that i dont have duplicated emails in the new concatenated filed.

    Thank you.

      
    Thursday, August 13, 2020 6:04 PM
  • Its standard SQL 2016.
    Thursday, August 13, 2020 6:04 PM
  • DECLARE @Table1 TABLE (
    	[ID] int,
    	[Name] varchar(50),
    	[Email1] varchar(100),
    	[Email2] varchar(1000)
    );
    
    INSERT INTO @Table1 VALUES
    (1, 'Test', 'Test1@test.com', 'Test2@test.com,test3@test.com'),
    (2, 'Test2', 'Test4@test.com', 'Test2@test.com,test3@test.com,Test4@test.com');
    
    SELECT [ID], [Name], CASE WHEN CHARINDEX([Email1], [Email2]) = 0 THEN CONCAT([Email1], ',', [Email2]) ELSE [Email2] END AS [NewEmail]
    FROM @Table1;


    A Fan of SSIS, SSRS and SSAS

    Thursday, August 13, 2020 6:06 PM
  • The question is - can the email1 also contain multiple emails? If not and it's always a single email, then I would go with _Viorel's solution (this is what I had in mind when I just read your original question), but if email1 can have multiple emails, then that solution is not going to work and we need to split values first in both columns.

    Looking for new opportunities

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Thursday, August 13, 2020 6:07 PM
  • I really do not care about the case sensitivity at all. 
    Thursday, August 13, 2020 6:11 PM
  • This looks like exactly what i want . . . let me apply it in the original script that i was working and will get back to you.

    Thanks much!!

    Thursday, August 13, 2020 6:12 PM
  • BTW, @Viorel_ - I'm thinking it also may not work exactly since we can have

    test@test.com

    and subtest@test.com and your code will consider them duplicates. The clean way is to split and combine back.


    Looking for new opportunities

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Thursday, August 13, 2020 6:12 PM
  • BTW, @Viorel_ - I'm thinking it also may not work exactly since we can have

    test@test.com

    and subtest@test.com and your code will consider them duplicates. The clean way is to split and combine back.


    Looking for new opportunities

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    I think that such cases are covered, and it will work as expected.


    Thursday, August 13, 2020 6:19 PM
  • This looks like exactly what i want . . . let me apply it in the original script that i was working and will get back to you.

    Thanks much!!

    Hi kiruman,

    Looking forward for your feedback on my solution.

    • Marked as answer by kiruman Thursday, August 13, 2020 6:43 PM
    Thursday, August 13, 2020 6:41 PM
  • Worked out perfect!!
    Thursday, August 13, 2020 6:43 PM
  • Ha-ha :)

    I guess the issue with the XML approach that it's very powerful, but it's case-sensitive and you can not really make it case-insensitive without lots of tricks.


    Looking for new opportunities

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Hi Naomi,

    You are correct with that statement "...XML approach that it's very powerful...".

    There is a certain SQL Server XQuery deficiency in putting a function in the XPath expression like this:

    /root/r/lower-case(text())

    That why we need more votes here: SQL Server vNext (post 2019) and NoSQL functionality

    so Microsoft will implement that functionality.

    Thursday, August 13, 2020 6:49 PM