How to prevent Dimension tables from duplicate records?


  • Hi,

    As we know in a data warehouse there are some fact and dimension tables to store historical data. There is an ETL process to extract, transform and load data from heterogeneous data sources to a data warehouse. My problem in this process is about repeated records.

    Suppose that there is a dim table with data that is coming from other resources, the primary key of the dim table is an identity field. How should I prevent previously added records to not to be added again to the dim table?

    Suppose that I want to extract accounts and load them to the DimAccount. I created a task to do this in the SSIS. After first time of task execution 500,000 records would be added to the DimAccount. There are approximately 100 new accounts per day. I want to run the task every night to extract and load just 100 new accounts.

    How is it possible? Does SSIS have a solution for this problem?

    Thanks for attentions.


    Tuesday, June 05, 2012 2:35 PM

All replies

  • The usual way would be by using a lookup that joins on the natural key of the data in the DIM table. And then you would only insert on the nomatch output of the lookup.

    Chuck Pedretti | Magenic – North Region |

    Tuesday, June 05, 2012 2:42 PM
  • Excuse me Zhao

    but I think this is not the answer. He just mentioned, He didn't explain it .

    Monday, June 11, 2012 10:36 AM
  • I'd agree with Chuck,

    Make sure that you've built your dimension right.

    Create a Surrogate key (Identity 1,1), then add a column for your Natural Key (AccountID 100345 as e.g.), then do a lookup in SSIS against your AccountID in your Source, compared to your AccountID in your Dim Table, and only load the not matched, another good way is to take the not matched to a conditional split with a ISNULL(Lookup.AccountID), which basically says that if the record returns a NULL in the Dim Table, then it's new, so Load that one, else it would not be a new record. Best Practice in my opinion to cater for it all is to do a Hash check, and then cater for changes too. You should be covered on all aspects then. Hope this helps.

    Kind Regards

    Carel Greaves

    Monday, June 11, 2012 11:09 AM
  • You can use MERGE if you have SQL Server 2008 and upwards.


    Please vote as helpful or mark as answer, if it helps
    Cheers, Raunak | t: @raunakjhawar | My Blog

    Monday, June 11, 2012 11:12 AM
  • You can use MERGE if you have SQL Server 2008 and upwards.


    Please vote as helpful or mark as answer, if it helps
    Cheers, Raunak | t: @raunakjhawar | My Blog

    Assuming that you can access both your source and destination data in the same query.

    Chuck Pedretti | Magenic – North Region |

    Monday, June 11, 2012 12:28 PM
  • you will have to create one more table calle table_load  that will have  2 more columns, one date time that will be get date() for you to know when was the last update. and other flag datatype =bit

    in execute sql task preferably run a procedure that has merge in it where you will pull in all the records ,you will set the flag to 1 in this table_load

    in your ssis you will have execute sql task calling all records where

    flag =1

    once you get all 500,000 records you will send them to your main table, you can get rid of those 2 extra columns if you want to now

    In execute sql you will  change the flag to 0 in table_load ,


    next time when you will run your package it will first do your merge procedure and get the records that do not match on ID (which ever is your ID columns)  you will set the flag to 1, 

    then in execute sql you will  again call all records that have flag =1

    this time you will get only records that are new as they have flag=1

    once loaded in the end you will change the flag to 0 back

    Miss Never Giveup

    Monday, June 11, 2012 1:36 PM