none
Duplicate row entries in SQL database

    Question

  • Hi All,

    Can anbody please help me with my doubt, I have been working on this for about weeks..

    below given is SELECT statement of my database table products_colors

    SELECT brand_id,product_id,product_color_id,product_n,color_code_n,color_name,image_n,pic1,pic2,pic3,pic4,pic5,offer_clicks,changed_at,most_popular_i,clearance_i,auto_add,product_upc_code,special_order_i,

    QtyLocation1,QtyLocation2,QtyLocation3,QtyLocation4,QtyLocation5,QtyLocation6,QtyLocation7,QtyLocation8,QtyLocation9,QtyLocation10,Username,wholesale_price,my_cost_price,counter_offer_price,temp_status,

    iudate FROM products_colors

    here, in this table there are total 42,000 rows and it has more than 1000 of duplicate entries, this duplicate entries can be grouped by product_n and color_code_n table, identity key column is product_color_id

    Now I want to write a query in such a  way that if (COUNT(product_n))>1 AND (COUNT(color_code_n))>1 than it will do SUM(QtyLocation1),SUM(QtyLocation2),SUM(QtyLocation3),SUM(QtyLocation4),SUM(QtyLocation5),SUM(QtyLocation6),SUM(QtyLocation7),SUM(QtyLocation8),SUM(QtyLocation9),SUM(QtyLocation10) and will delete the duplicate entries.

     

    any suggestion on this please help me out ... will appreciate your response!!

     

    Thank You,

     

    Tuesday, March 23, 2010 2:14 AM

Answers

  • hi,

    your design seems to suffer from very unnormalized design flaws..

    assuming you can effectively identify the row you want to update with the aggregation of all SUM(QtyLocationX), you actually can, but please remember that you will loose the other attributes values, say [picX], [offer_click], [changed_at] and the like.. this is becouse only 1 row with [product_n] = 'abc' will be later available..

    going on, you can start aggregating your data and populating a temp table you will later use for the updates..

    having that aggregated data in the temp table, you have now to identify which row, for each product_n group, you want to preserve.. in the following trivial sample I used the row with the MINimal [id] between each group.. having that row identified, you can update it with the aggregated data from the temp table, and finally drop that temp table..

    SET NOCOUNT ON;
    USE tempdb;
    GO
    CREATE TABLE dbo.t (
     Id int NOT NULL,
     product_n int NOT NULL,
     QtyLocation1 int,
     Other int
     
     );
    GO
    INSERT INTO dbo.t VALUES (1, 1, 10, 9);
    INSERT INTO dbo.t VALUES (2, 2, 10, 8);
    INSERT INTO dbo.t VALUES (3, 2, 10, 7);
    INSERT INTO dbo.t VALUES (4, 3, 10, 6);
    INSERT INTO dbo.t VALUES (5, 3, 10, 5);
    INSERT INTO dbo.t VALUES (6, 3, 10, 4);
    INSERT INTO dbo.t VALUES (7, 4, 10, 3);
    GO
    PRINT 'these are the ''duplicated rows'', where [product_n] is duplicated';
    PRINT 'but not other parts of the attribute..';
    SELECT t.product_n, COUNT(*) AS [Count],
     SUM(t.QtyLocation1) AS [Sum]
     FROM dbo.t
     GROUP BY t.product_n
     HAVING COUNT(*) > 1;
    GO
    PRINT 'we are working in the sand castle of a transaction';
    PRINT 'in order to keep the whole work atomic..';
    BEGIN TRAN;
    PRINT 'populating a temp table with all your sums..';
    SELECT t.product_n, COUNT(*) AS [Count],
     SUM(t.QtyLocation1) AS [Sum]
     INTO #temp
     FROM dbo.t
     GROUP BY t.product_n
     HAVING COUNT(*) > 1;

    PRINT 'these are the rows that will be updated with the sums..';
    PRINT 'we are assuming that, for each product_n group, we will';
    PRINT 'update (with the SUM..) and preserve the row with the';
    PRINT 'MIN [id] for each product_n group';
    SELECT *
     FROM #temp
      JOIN dbo.t ON #temp.product_n = t.product_n
      WHERE  t.Id = (SELECT MIN(t2.Id)
           FROM dbo.t t2
           WHERE t2.product_n = t.product_n
       )

    PRINT 'updating..';
    UPDATE dbo.t
     SET QtyLocation1 = #temp.[Sum]
     FROM dbo.t t1
      JOIN #temp ON #temp.product_n = t1.product_n
      WHERE  t1.Id = (SELECT MIN(t2.Id)
           FROM dbo.t t2
           WHERE t2.product_n = t1.product_n
       );

    PRINT 'deleting ''duplicated rows''.. meaning that all the rows';
    PRINT 'with an [id] > than the MIN(id) for each product_n group';
    PRINT 'will be deleted..';
    DELETE
     FROM dbo.t
     WHERE Id IN
    (SELECT t1.Id
     FROM dbo.t t1
      JOIN #temp ON #temp.product_n = t1.product_n
     WHERE  t1.Id > (SELECT MIN(t2.Id)
           FROM dbo.t t2
           WHERE t2.product_n = t1.product_n
       )
     );

    PRINT 'and the result is..';
    SELECT *
     FROM dbo.t;
     
    PRINT 'dropping the temp table..';
    DROP TABLE #temp;
    COMMIT;
    GO
    DROP TABLE dbo.t;
    --<-----------------
    these are the 'duplicated rows', where [product_n] is duplicated
    but not other parts of the attribute..
    product_n   Count       Sum
    ----------- ----------- -----------
    2           2           20
    3           3           30

    we are working in the sand castle of a transaction
    in order to keep the whole work atomic..
    populating a temp table with all your sums..
    these are the rows that will be updated with the sums..
    we are assuming that, for each product_n group, we will
    update (with the SUM..) and preserve the row with the
    MIN [id] for each product_n group
    product_n   Count       Sum         Id          product_n   QtyLocation1 Other
    ----------- ----------- ----------- ----------- ----------- ------------ -----------
    2           2           20          2           2           10           8
    3           3           30          4           3           10           6

    updating..
    deleting 'duplicated rows'.. meaning that all the rows
    with an [id] > than the MIN(id) for each product_n group
    will be deleted..
    and the result is..
    Id          product_n   QtyLocation1 Other
    ----------- ----------- ------------ -----------
    1           1           10           9
    2           2           20           8
    4           3           30           6
    7           4           10           3

    dropping the temp table..

    please test your solution in a repro scenario :)

    regards


    http://www.asql.biz - DbaMgr2k - DbaMgr and further SQL Tools
    Tuesday, March 23, 2010 11:14 PM

All replies

  • hi,

    your design seems to suffer from very unnormalized design flaws..

    assuming you can effectively identify the row you want to update with the aggregation of all SUM(QtyLocationX), you actually can, but please remember that you will loose the other attributes values, say [picX], [offer_click], [changed_at] and the like.. this is becouse only 1 row with [product_n] = 'abc' will be later available..

    going on, you can start aggregating your data and populating a temp table you will later use for the updates..

    having that aggregated data in the temp table, you have now to identify which row, for each product_n group, you want to preserve.. in the following trivial sample I used the row with the MINimal [id] between each group.. having that row identified, you can update it with the aggregated data from the temp table, and finally drop that temp table..

    SET NOCOUNT ON;
    USE tempdb;
    GO
    CREATE TABLE dbo.t (
     Id int NOT NULL,
     product_n int NOT NULL,
     QtyLocation1 int,
     Other int
     
     );
    GO
    INSERT INTO dbo.t VALUES (1, 1, 10, 9);
    INSERT INTO dbo.t VALUES (2, 2, 10, 8);
    INSERT INTO dbo.t VALUES (3, 2, 10, 7);
    INSERT INTO dbo.t VALUES (4, 3, 10, 6);
    INSERT INTO dbo.t VALUES (5, 3, 10, 5);
    INSERT INTO dbo.t VALUES (6, 3, 10, 4);
    INSERT INTO dbo.t VALUES (7, 4, 10, 3);
    GO
    PRINT 'these are the ''duplicated rows'', where [product_n] is duplicated';
    PRINT 'but not other parts of the attribute..';
    SELECT t.product_n, COUNT(*) AS [Count],
     SUM(t.QtyLocation1) AS [Sum]
     FROM dbo.t
     GROUP BY t.product_n
     HAVING COUNT(*) > 1;
    GO
    PRINT 'we are working in the sand castle of a transaction';
    PRINT 'in order to keep the whole work atomic..';
    BEGIN TRAN;
    PRINT 'populating a temp table with all your sums..';
    SELECT t.product_n, COUNT(*) AS [Count],
     SUM(t.QtyLocation1) AS [Sum]
     INTO #temp
     FROM dbo.t
     GROUP BY t.product_n
     HAVING COUNT(*) > 1;

    PRINT 'these are the rows that will be updated with the sums..';
    PRINT 'we are assuming that, for each product_n group, we will';
    PRINT 'update (with the SUM..) and preserve the row with the';
    PRINT 'MIN [id] for each product_n group';
    SELECT *
     FROM #temp
      JOIN dbo.t ON #temp.product_n = t.product_n
      WHERE  t.Id = (SELECT MIN(t2.Id)
           FROM dbo.t t2
           WHERE t2.product_n = t.product_n
       )

    PRINT 'updating..';
    UPDATE dbo.t
     SET QtyLocation1 = #temp.[Sum]
     FROM dbo.t t1
      JOIN #temp ON #temp.product_n = t1.product_n
      WHERE  t1.Id = (SELECT MIN(t2.Id)
           FROM dbo.t t2
           WHERE t2.product_n = t1.product_n
       );

    PRINT 'deleting ''duplicated rows''.. meaning that all the rows';
    PRINT 'with an [id] > than the MIN(id) for each product_n group';
    PRINT 'will be deleted..';
    DELETE
     FROM dbo.t
     WHERE Id IN
    (SELECT t1.Id
     FROM dbo.t t1
      JOIN #temp ON #temp.product_n = t1.product_n
     WHERE  t1.Id > (SELECT MIN(t2.Id)
           FROM dbo.t t2
           WHERE t2.product_n = t1.product_n
       )
     );

    PRINT 'and the result is..';
    SELECT *
     FROM dbo.t;
     
    PRINT 'dropping the temp table..';
    DROP TABLE #temp;
    COMMIT;
    GO
    DROP TABLE dbo.t;
    --<-----------------
    these are the 'duplicated rows', where [product_n] is duplicated
    but not other parts of the attribute..
    product_n   Count       Sum
    ----------- ----------- -----------
    2           2           20
    3           3           30

    we are working in the sand castle of a transaction
    in order to keep the whole work atomic..
    populating a temp table with all your sums..
    these are the rows that will be updated with the sums..
    we are assuming that, for each product_n group, we will
    update (with the SUM..) and preserve the row with the
    MIN [id] for each product_n group
    product_n   Count       Sum         Id          product_n   QtyLocation1 Other
    ----------- ----------- ----------- ----------- ----------- ------------ -----------
    2           2           20          2           2           10           8
    3           3           30          4           3           10           6

    updating..
    deleting 'duplicated rows'.. meaning that all the rows
    with an [id] > than the MIN(id) for each product_n group
    will be deleted..
    and the result is..
    Id          product_n   QtyLocation1 Other
    ----------- ----------- ------------ -----------
    1           1           10           9
    2           2           20           8
    4           3           30           6
    7           4           10           3

    dropping the temp table..

    please test your solution in a repro scenario :)

    regards


    http://www.asql.biz - DbaMgr2k - DbaMgr and further SQL Tools
    Tuesday, March 23, 2010 11:14 PM
  • hi

    thank you for replying I will try doing this.. will keep you posted !!

    Wednesday, March 24, 2010 11:09 AM