none
Duplicate Across Multiple Tables

    Question

  • I need help in planning the best course of action to find duplicates from multiple tables.

         
        

    DECLARE @Table1 TABLE (ID_T1 int, Col1 varchar(10), C2 varchar(10), C3 varchar(10), C4 varchar(10), Col5 varchar(10), Col6 varchar(10))
        DECLARE @Table2 TABLE (ID_T2 int, Col1 varchar(10), C2 varchar(10), C3 varchar(10), C4 varchar(10), Col5 varchar(10), Col6 varchar(10))
        

    INSERT INTO @Table1 (ID_T1, Col1, C2, C3, C4, Col5, Col6) SELECT 1, 'One', 'Test1', 'Line1', 'Record1', 'OTLR1', 'RLTO1' UNION ALL SELECT 2, 'Two', 'Test2', 'Line2', 'Record2', 'OTLR2', 'RLTO2' UNION ALL SELECT 3, 'Three', 'Test3', 'Line3', 'Record3', 'OTLR3', 'RLTO3' UNION ALL SELECT 4, 'Four', 'Test4', 'Line4', 'Record4', 'OTLR4', 'RLTO4' UNION ALL SELECT 5, 'Five', 'Test5', 'Line5', 'Record5', 'OTLR5', 'RLTO5' UNION ALL SELECT 6, 'Six', 'Test6', 'Line6', 'Record6', 'OTLR6', 'RLTO6' UNION ALL SELECT 7, 'Seven', 'Test6', 'Line6', 'Record6', 'OTLR7', 'RLTO7' UNION ALL SELECT 8, 'Eight', 'Test8', 'Line8', 'Record8', 'OTLR8', 'RLTO8' INSERT INTO @Table2 (ID_T2, Col1, C2, C3, C4, Col5, Col6) SELECT 10, 'Ten', 'Test1', 'Line1', 'Record1', 'OTLR10', 'RLTO10' UNION ALL SELECT 20, 'Twenty', 'Test2', 'Line2', 'Record2', 'OTLR20', 'RLTO20' UNION ALL SELECT 30, 'Thirty', 'Test3', 'Line3', 'Record3', 'OTLR30', 'RLTO30' UNION ALL SELECT 40, 'Forty', 'Test4', 'Line4', 'Record4', 'OTLR40', 'RLTO40' UNION ALL SELECT 50, 'Fifty', 'Test5', 'Line5', 'Record5', 'OTLR50', 'RLTO50' UNION ALL SELECT 80, 'Eighty', 'Test80', 'Line80', 'Record80', 'OTLR80', 'RLTO80' UNION ALL SELECT 90, 'Ninety', 'Test90', 'Line90', 'Record90', 'OTLR90', 'RLTO90' SELECT * FROM @Table1 SELECT * FROM @Table2



    Now, C2, C3, and C4 can have either unique or duplicate values in Table1 and Table2. 

    I am trying to get three outputs.
    Output 1 will have ONLY the records from Table 1 which has same C2, C3, and C4 columns' values in table 2 with Duplicate marked as 1/0 in Duplicate_SameTable

    Output 2 will have ONLY the records from Table 1 which has same C2, C3, and C4 columns' values in table 2 with Duplicate marked as 1/0 in Duplicate_PrimaryTable

    Output 3 will have the records from Table 1 and Tabl2 2 which has same C2, C3, and C4 columns' values with Duplicate marked as 1/0 in Duplicate_BothTables.

    I can get the Output 1 from the following query.

        SELECT *, CASE
        			WHEN COUNT(*) OVER (PARTITION BY  C2, C3, C4) > 1 THEN 1
        			ELSE 0
        		 END AS Duplicate_SameTable
        FROM @Table1
        ORDER BY ID_T1 ASC

    Output 2 

        SELECT B.ID, B.Col1, B.C2, B.C3, B.C4, B.Col5, B.Col6, CASE WHEN C.Duplicate_SameTable = 1 THEN 0 ELSE B.Duplicate_BothTables END AS Duplicate_PrimaryTable
        FROM (
        SELECT ID, Col1, C2, C3, C4, Col5, Col6, CASE
                    WHEN COUNT(*) OVER (PARTITION BY  C2, C3, C4) > 1 THEN 1
                    ELSE 0
                 END AS Duplicate_BothTables FROM ( 
        SELECT ID_T1 AS ID, Col1, C2, C3, C4, Col5, Col6 FROM @Table1
        UNION
        SELECT ID_T2 AS ID, Col1, C2, C3, C4, Col5, Col6 FROM @Table2) A
        ) B INNER JOIN (SELECT *, CASE
                    WHEN COUNT(*) OVER (PARTITION BY  C2, C3, C4) > 1 THEN 1
                    ELSE 0
                 END AS Duplicate_SameTable
        FROM @Table1) C ON B.ID = C.ID_T1



    Output 3

        SELECT B.ID, B.Col1, B.C2, B.C3, B.C4, B.Col5, B.Col6, CASE WHEN C.Duplicate_SameTable = 1 THEN 0 ELSE B.Duplicate_BothTables END AS Duplicate_PrimaryTable
        FROM (
        SELECT ID, Col1, C2, C3, C4, Col5, Col6, CASE
                    WHEN COUNT(*) OVER (PARTITION BY  C2, C3, C4) > 1 THEN 1
                    ELSE 0
                 END AS Duplicate_BothTables FROM ( 
        SELECT ID_T1 AS ID, Col1, C2, C3, C4, Col5, Col6 FROM @Table1
        UNION
        SELECT ID_T2 AS ID, Col1, C2, C3, C4, Col5, Col6 FROM @Table2) A
        ) B LEFT JOIN (SELECT *, CASE
                    WHEN COUNT(*) OVER (PARTITION BY  C2, C3, C4) > 1 THEN 1
                    ELSE 0
                 END AS Duplicate_SameTable
        FROM @Table1) C ON B.ID = C.ID_T1
        ORDER BY B.ID


    I am wondering how to go about getting Output 2 and Output 3  more efficiently. 

    One way I can think of doing is Union Table 1 and Table 2 and then run above query. Or is there a better way of doing this as real tables will have millions of records and doing UNION and then applying above query might take much longer.

    Thank you

    • Edited by Ahungry Boi Wednesday, September 04, 2013 4:30 PM
    Wednesday, September 04, 2013 4:15 PM

Answers

  • Sorry, cut and paste error.  It should be

    --Query 2
    Select ID_T1 AS ID, Col1, C2, C3, C4, Col5, Col6, 
      Case When Exists(Select * From @Table2 t2 Where t1.C2 = t2.C2 And t1.C3 = t2.C3 And t1.C4 = t2.C4) Then 1 Else 0 End As Duplicate_PrimaryTable
    From @Table1 t1;
    --Query 3
    Select ID_T1 AS ID, Col1, C2, C3, C4, Col5, Col6, 
      Case When Exists(Select * From @Table2 t2 Where t1.C2 = t2.C2 And t1.C3 = t2.C3 And t1.C4 = t2.C4) Then 1 Else 0 End As Duplicate_PrimaryTable
    From @Table1 t1
    Union All
    Select ID_T2 AS ID, Col1, C2, C3, C4, Col5, Col6, 
      Case When Exists(Select * From @Table1 t1 Where t1.C2 = t2.C2 And t1.C3 = t2.C3 And t1.C4 = t2.C4) Then 1 Else 0 End As Duplicate_BothTables
    From @Table2 t2;

    Tom

    • Marked as answer by Ahungry Boi Thursday, September 05, 2013 2:35 PM
    Wednesday, September 04, 2013 11:07 PM

All replies

  • --Query 2
    Select ID_T1 AS ID, Col1, C2, C3, C4, Col5, Col6, 
      Case When Exists(Select * From @Table2 t2 Where t1.C2 = t2.C2 And t1.C3 = t2.C3 And t1.C4 = t2.C4) Then 1 Else 0 End As Duplicate_PrimaryTable
    From @Table1 t1;
    --Query 3
    Select ID_T1 AS ID, Col1, C2, C3, C4, Col5, Col6, 
      Case When Exists(Select * From @Table2 t2 Where t1.C2 = t2.C2 And t1.C3 = t2.C3 And t1.C4 = t2.C4) Then 1 Else 0 End As Duplicate_PrimaryTable
    From @Table1 t1;

    Having the correct indexes will be very important to performance.  You may have to experiment with the best choices, but my starting recommendation would be to have a nonclustered index on C2,C3,C4 on both tables.

    Tom

    Wednesday, September 04, 2013 4:52 PM
  • Hi Tom,

    Your queries look much nicer than mine. I will keep that indexing in mind (although a query where I won't need indexes would be great as a lot of insert/update will happen based on the output and having indexes will reduce insert/update performance.)

    But as for the above queries, am I missing something or Query 2 and 3 are the same?

    Thank you

    Wednesday, September 04, 2013 6:34 PM
  • Sorry, cut and paste error.  It should be

    --Query 2
    Select ID_T1 AS ID, Col1, C2, C3, C4, Col5, Col6, 
      Case When Exists(Select * From @Table2 t2 Where t1.C2 = t2.C2 And t1.C3 = t2.C3 And t1.C4 = t2.C4) Then 1 Else 0 End As Duplicate_PrimaryTable
    From @Table1 t1;
    --Query 3
    Select ID_T1 AS ID, Col1, C2, C3, C4, Col5, Col6, 
      Case When Exists(Select * From @Table2 t2 Where t1.C2 = t2.C2 And t1.C3 = t2.C3 And t1.C4 = t2.C4) Then 1 Else 0 End As Duplicate_PrimaryTable
    From @Table1 t1
    Union All
    Select ID_T2 AS ID, Col1, C2, C3, C4, Col5, Col6, 
      Case When Exists(Select * From @Table1 t1 Where t1.C2 = t2.C2 And t1.C3 = t2.C3 And t1.C4 = t2.C4) Then 1 Else 0 End As Duplicate_BothTables
    From @Table2 t2;

    Tom

    • Marked as answer by Ahungry Boi Thursday, September 05, 2013 2:35 PM
    Wednesday, September 04, 2013 11:07 PM