locked
How to prevent duplicate keys in archive database? RRS feed

  • Question

  • I am struggling with this problem.

    Background: I'm working on a project where I have to make an archive database. The archive database should get all data of the operational database. It should even save every update ever made, so it literally contains the entire history of the operational database (this is a must and the whole project revolves around this idea). So this is solved by using Change Data Capture. After that the data should go through a staging area and eventually in the data warehouse database. I came out with a solution and worked it out in the prototype and it seemed to be working all fine. I stupidly forgot to include the foreign keys, so the archive database didn't have the original structure but it should ofcourse (no wonder it went okay without too much hassle).

    Problem: Because we want to store everything in archive, there will be duplicate primary keys (just for instance, many same contact_id's because telephone number changes a couple of times). I thought to solve this by adding a new primary key which says is auto-increment and purely exist to make a record unique. But when it comes to foreign keys, it's impossible. You want contact_id to be able to be duplicate and in that case it cannot be a primary key. But foreign key is only able to reference to a primary key or another unique key but not other normal columns.

    Any advice on this? It's an absolute must to store all changes.

    Wednesday, August 20, 2014 1:28 PM

Answers

  • This is a typicl Type 2 scenario in any dimension. You will keep only surrogate keys (unique valued auto-increment field in your case). All references will be on this surrogate key which you will set as primary key in main table and foreign key in the referenced tables. So it will always point to one and only record. You will also maintain two date fields in every dimension to indicate the time for which a particular entry is valid. Whenever you populate child tables the references are populated by not only joining on id fields but also looking for time period of the entry so it will always match only a single entry which would be referential key you populate in child table.

    More details here

    http://www.folkstalk.com/2012/11/how-to-load-rows-into-fact-table-in.html


    Please Mark This As Answer if it helps to solve the issue Visakh ---------------------------- http://visakhm.blogspot.com/ https://www.facebook.com/VmBlogs

    Wednesday, August 20, 2014 2:34 PM
  • Yeah, agreed. Take a look at Kimball's Type 2 dimension. Wikipedia has a good example: http://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2. Another type of solution is called a temporal database http://en.wikipedia.org/wiki/Temporal_database. Wikipedia has serviceable examples (I didn't read the entire article to answer you, but I skimmed it and the pictures are what I expect :)

    Basically, your key becomes the original key + an effective time range. In the data warehouse type 2, they then implement a surrogate key to stand in for that value, and as you load the tables, you choose the time range and use the new dimension surrogate value for references. Very easy to use, a wee bit fragile, so you need to follow given patterns for loading/modifying. (Just like Visakh's link says too).

    For temporal, you could use a surrogate or natural key, but you just keep versions of data around. So when you select from a table, you always pass in a time frame, and you see past versions of data (based on every row's begin and end times). Even deleted rows would be still in the table. If you want the current row, it would be the one with the endDate out past the current one (and there should only ever be one.)

    I would go for the dimensional design model myself (and I have :)


    Louis

    Without good requirements, my advice is only guesses. Please don't hold it against me if my answer answers my interpretation of your questions.

    Wednesday, August 20, 2014 4:50 PM
  • I have never used the built in scd SSIS task.

    what I have used is the http://dimensionmergescd.codeplex.com/ 3rd party addin, but it had memory issues (both the free ver and the supported pragmatic works ver) for larger datasets the last time i looked at it.

    I think you'll need to implement a the logic yourself anyway because your from/to dates aren't the execution dates of your ETL process, but the date of your CDC events.

    Also consider using this approach instead of the usual from/to effective dates: http://sqlblog.com/blogs/jamie_thomson/archive/2009/11/28/debunking-kimball-effective-dates.aspx

    In your scenario only managing one date would result in much simpler and easier to mange code at the cost of query time performance. Up to you to decide whether that's a fair tradeoff.

    The way I usually do it is use the checksum task, merge join, conditional split. Much faster than the built in scd and rows that cannot satisfy a match are disposed of whereas the dim merge scd task keeps them in the buffer and blows out your memory

    checksum task is here: http://www.sqlis.com/post/Checksum-Transformation.aspx



    above is an example of a type 1 scd implementation using the checksum+merge+ conditional split

    What I would try in your situation is set up every table to use the natural key + cdc date as the primary key and have the ETLs perform inserts only (type 0? - existing rows never change). all joins between the tables would need to be on the business key + check the cdc date in your main table against the from/to date range in the table you're joining (you could create views that transpose the skey from the various tables into each other to simplify querying for end users. I'd take the performance hit for the accuracy and simplicity vs trying to maintain a single column surrogate key between the various archive tables in the ETL and enable fkey enforcement at the data layer...

    Depending on your use case - eg users only care about the current version of data - as suggested above, have two sets of tables for each source table. The current version of the data (traditional type 1) and an archive table that holds historical data. you could create a union all view over the top so you can query the full dataset when required.


    philosophical debate on some of the wording of the above replies - scds aren't an auditing tool - they're there to be able to rerun a past select and get the same resultset when you use the previous rundate as a filter. What you're doing is using a date value from the source system as part of your composite business key, so i'd call what you have a type 0 dim that only does inserts (since 'updates' will appear as new rows using the natural key + cdc event date as the primary key)

    eg say your scd source changed multiple times since your last refresh. Running a type 2 etl would only apply the latest change and set the effective date for the new row to the execution date of the ETL (not some date pulled from the source system). The intermediate changes that happened in the source system weren't queryable from the data warehouse so they have no place in the data warehouse (in a typical type 2 scd scenario)

    but anyway, that's just how i think of it.


    Jakub @ Adelaide, Australia Blog

    • Marked as answer by Kalman Toth Thursday, November 20, 2014 2:34 PM
    Thursday, November 20, 2014 2:33 AM

All replies

  • My suggestion would be to store the history in separate history tables that can easily be maintained via triggers. This way you store the latest version of the records in the original tables (keeping the original schema) and any previous changes in the corresponding history table.

    Would that work for you?


    Bodo Michael Danitz - MCT, MCITP - free consultant - performance specialist - www.sql-server.de

    Wednesday, August 20, 2014 2:27 PM
  • This is a typicl Type 2 scenario in any dimension. You will keep only surrogate keys (unique valued auto-increment field in your case). All references will be on this surrogate key which you will set as primary key in main table and foreign key in the referenced tables. So it will always point to one and only record. You will also maintain two date fields in every dimension to indicate the time for which a particular entry is valid. Whenever you populate child tables the references are populated by not only joining on id fields but also looking for time period of the entry so it will always match only a single entry which would be referential key you populate in child table.

    More details here

    http://www.folkstalk.com/2012/11/how-to-load-rows-into-fact-table-in.html


    Please Mark This As Answer if it helps to solve the issue Visakh ---------------------------- http://visakhm.blogspot.com/ https://www.facebook.com/VmBlogs

    Wednesday, August 20, 2014 2:34 PM
  • Yeah, agreed. Take a look at Kimball's Type 2 dimension. Wikipedia has a good example: http://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2. Another type of solution is called a temporal database http://en.wikipedia.org/wiki/Temporal_database. Wikipedia has serviceable examples (I didn't read the entire article to answer you, but I skimmed it and the pictures are what I expect :)

    Basically, your key becomes the original key + an effective time range. In the data warehouse type 2, they then implement a surrogate key to stand in for that value, and as you load the tables, you choose the time range and use the new dimension surrogate value for references. Very easy to use, a wee bit fragile, so you need to follow given patterns for loading/modifying. (Just like Visakh's link says too).

    For temporal, you could use a surrogate or natural key, but you just keep versions of data around. So when you select from a table, you always pass in a time frame, and you see past versions of data (based on every row's begin and end times). Even deleted rows would be still in the table. If you want the current row, it would be the one with the endDate out past the current one (and there should only ever be one.)

    I would go for the dimensional design model myself (and I have :)


    Louis

    Without good requirements, my advice is only guesses. Please don't hold it against me if my answer answers my interpretation of your questions.

    Wednesday, August 20, 2014 4:50 PM
  • All of you, thanks for replying, I'm happy you're trying to help me out with this problem. 

    Visakh and Louis, thanks that seems like the solution for this case indeed. Yes, the dimensional design appeals more to me as well.

    I read the articles and watched some tutorials. But I can't work it around the solution that I had.
    More background info: I use CDC to track all the changes done in the operational database and SSIS (following one of Matt Mason's tutorials and with a lot of alterations to make it fit for my project). I have this control flow (don't mind that error haha):

    (Oh apparently I cannot add images yet, so here's the link for the screenshot:) http://nl.tinypic.com/r/w0p1u0/8

    Basically I create staging tables in my archive database next to my normal archive tables. Then start CDC control task to get the processing range and then it copies everything from the operational database (joined with a few CDC columns) to the staging tables. After that the processing range ends so it will only get the rows it hasn't processed before. And then I do some updates on the staging tables and then finally insert everything into the archive tables. The staging tables then can be truncated. After this the data will go to the staging area for transformations and then finally at last to the DWH. The reason for having a staging area between the archive and DWH is that the archive will not only be used as source for the DWH but also on it's own. The DWH will not contain 100% the same stuff as the archive (like maybe some transformations, extra columns with calculated fields, plus some columns don't need to be in the DWH at all). When all the ETL stuff is done in SSIS, I have to use SSAS to define all the facts, dimensions, cubes. 

    Example: So I try to work with the SCD type 2. If I understood it correctly (and maybe I didn't): for example, the contact table in archive should have the surrogate key ID (the auto-increment one). The business key is the contact_id and can be used uniquely with the time range columns. 
    Following Visakh's post, the ID becomes the key that the foreign key will reference to. For example: 

    Contact table:

    ID: 1 contact_id: 100
    Name: Glenn start_time: 2014-01-01 End_time: 2014-08-20

    ID: 2 Contact_id: 100 Name: Danzig Start_time: 2014-08-20 end_time: NULL

    Sorry, I couldn't style it as table somehow. So the employee changed his name. It makes sense that the time period tells when the first name was valid. 

    Organisation table: 

    ID: 1
    org_id: 20 
    Contact_id: 1
    Start_time: 2014-01-01
    End_time:NULL

    (it references to ID instead of contact_id as suggested)

    The employee belongs to an organisation. It references 1 which is still old data. But this is the last version of the organisation record. 

    So then I need a table to link the 2: 

    organisation_contact table
    contact_id:100
    org_id: 20

    and then I need another one to join with the surrogate key?

    ID: 1
    org_id: 20
    ID: 2
    org_id: 20

    (Guess it would make more sense to have org_id in the contact table but for now it's an example)

    Problems: I don't quite understand how this works. From the example I saw you have to have another table (the fact table) to link it to the surrogate key. Would this mean I have to have facts and dimension tables in my archive database? My intention was actually to have all records of the operational databases (all the updates too) in my archive. And after that create the facts and dimensions in the DWH with SSAS. The example looks like I should do it earlier. 

    I don't know how to combine this with the cdc solution. I want to get all the data by using CDC. Like how every update gets registered in the accompanying CDC table. Then the archive will get the CDC data. But then how to combine this in use with SCD. I have the surrogate key in archive (ID) and then I make the start and end time columns. I need to point all references to the ID and then make the other table to keep track of the contact_id (original PK) and another key. At last make another table to track all the current data in the fact. 

    Another question: Would you recommend the SCD task in SSIS. I read it was not that great if you have many rows to work with. What would you think is the best method to implement it. 

    Thanks so much again.


    EDIT: What about slowly changing dimensions type 4? It looks like you don't have to change the references of the foreign key then. Why do you prefer 2 over 4?
    • Edited by KimLee23 Friday, August 22, 2014 7:36 AM
    Wednesday, August 20, 2014 10:29 PM
  • Any thoughts for Kim's new questions?

    Thanks!


    Ed Price, Azure & Power BI Customer Program Manager (Blog, Small Basic, Wiki Ninjas, Wiki)

    Answer an interesting question? Create a wiki article about it!

    Thursday, November 20, 2014 12:36 AM
  • I have never used the built in scd SSIS task.

    what I have used is the http://dimensionmergescd.codeplex.com/ 3rd party addin, but it had memory issues (both the free ver and the supported pragmatic works ver) for larger datasets the last time i looked at it.

    I think you'll need to implement a the logic yourself anyway because your from/to dates aren't the execution dates of your ETL process, but the date of your CDC events.

    Also consider using this approach instead of the usual from/to effective dates: http://sqlblog.com/blogs/jamie_thomson/archive/2009/11/28/debunking-kimball-effective-dates.aspx

    In your scenario only managing one date would result in much simpler and easier to mange code at the cost of query time performance. Up to you to decide whether that's a fair tradeoff.

    The way I usually do it is use the checksum task, merge join, conditional split. Much faster than the built in scd and rows that cannot satisfy a match are disposed of whereas the dim merge scd task keeps them in the buffer and blows out your memory

    checksum task is here: http://www.sqlis.com/post/Checksum-Transformation.aspx



    above is an example of a type 1 scd implementation using the checksum+merge+ conditional split

    What I would try in your situation is set up every table to use the natural key + cdc date as the primary key and have the ETLs perform inserts only (type 0? - existing rows never change). all joins between the tables would need to be on the business key + check the cdc date in your main table against the from/to date range in the table you're joining (you could create views that transpose the skey from the various tables into each other to simplify querying for end users. I'd take the performance hit for the accuracy and simplicity vs trying to maintain a single column surrogate key between the various archive tables in the ETL and enable fkey enforcement at the data layer...

    Depending on your use case - eg users only care about the current version of data - as suggested above, have two sets of tables for each source table. The current version of the data (traditional type 1) and an archive table that holds historical data. you could create a union all view over the top so you can query the full dataset when required.


    philosophical debate on some of the wording of the above replies - scds aren't an auditing tool - they're there to be able to rerun a past select and get the same resultset when you use the previous rundate as a filter. What you're doing is using a date value from the source system as part of your composite business key, so i'd call what you have a type 0 dim that only does inserts (since 'updates' will appear as new rows using the natural key + cdc event date as the primary key)

    eg say your scd source changed multiple times since your last refresh. Running a type 2 etl would only apply the latest change and set the effective date for the new row to the execution date of the ETL (not some date pulled from the source system). The intermediate changes that happened in the source system weren't queryable from the data warehouse so they have no place in the data warehouse (in a typical type 2 scd scenario)

    but anyway, that's just how i think of it.


    Jakub @ Adelaide, Australia Blog

    • Marked as answer by Kalman Toth Thursday, November 20, 2014 2:34 PM
    Thursday, November 20, 2014 2:33 AM
  • This is how I implemented SCD logic using Hashing

    http://visakhm.blogspot.in/2014/06/ssis-tips-implementing-scd.html


    Please Mark This As Answer if it solved your issue
    Please Mark This As Helpful if it helps to solve your issue
    Visakh
    ----------------------------
    My MSDN Page
    My Personal Blog
    My Facebook Page

    Thursday, November 20, 2014 6:21 PM