locked
Need suggestions regarding a system design for incremental load RRS feed

  • Question

  • Hi,

    Our client has  a set of SQL Server tables which are being fully refreshed daily using views from DB2 source tables. We need to design an approach to load them incrementally.

    We have a third party application 'XXX' which will provide us the changed records in the underlying DB2 tables into SQL Server tables daily. Let us call them CDC_<tbl>. This table will have the same schema of source and have a flag to indicate whether it is an insert, update or delete record. From these CDC tables, we have to do the required transformations and do the insert/update/delete from the target table accordingly.

    This approach would work easily for cases where there is only one source table. But when we have multiple tables joined together to load the target table, we are unable to design an approach. If on a particular day, an insert record comes in only one of the source CDC tables, we will not be getting that row in the target since the other CDC table doesn't have a record for that particular key. We cannot join the SQL Server CDC table with source DB2 table since that will cause performance issues.

    Please share your thoughts on how we can design an approach which will work in cases of join, union, group by etc. between source tables. We are open to suggestions on changes in CDC tables also since the third party tool is to be configured as per our design needs.

    Thanks in advance,

    KP


    Wednesday, January 1, 2014 11:23 AM

Answers

  • If on a particular day, an insert record comes in only one of the source CDC tables, we will not be getting that row in the target since the other CDC table doesn't have a record for that particular key.

    If I understand correctly, you extract data using DB2 views (some with joins) and then use third party CDC software to capture all changes made to the underlying DB2 tables.  These changes are then applied to a transformed version of data extracted from the views.  If my understanding is correct, why use DB2 views at all?  It seems to me the transformation process must have intimate knowledge of the underlying DB2 tables anyway in order to apply the CDC data properly.

    Are you saying you are not getting the other CDC table row at all, or is it just that the CDC tables are not transactionally consistent at the time the changes are applied to the target?  I think transactional inconsistencies should be addressed by the third party CDC application. The other alternative as I see it is to query DB2 for a each CDC key to make sure you have the latest data during the transformation process.


    Dan Guzman, SQL Server MVP, http://www.dbdelta.com

    • Proposed as answer by Heidi-Duan Thursday, January 2, 2014 6:09 AM
    • Marked as answer by Elvis Long Tuesday, January 21, 2014 6:11 AM
    Wednesday, January 1, 2014 3:53 PM
    Answerer
  • "This approach would work easily for cases where there is only one source table. But when we have multiple tables joined together to load the target table, we are unable to design an approach. "

    Here's a viable approach.  Perform a full load of the target table each time using TSQL MERGE with the view as the source.  Use the matching in the MERGE to determine whether to INSERT, UPDATE or DELETE rather than the flags in your CDC tables.

    Or don't do that.  Maintain multiple target tables, and after you load the target tables put a view in front of them.

    David


    David http://blogs.msdn.com/b/dbrowne/

    • Proposed as answer by Heidi-Duan Thursday, January 2, 2014 6:09 AM
    • Marked as answer by Elvis Long Tuesday, January 21, 2014 6:11 AM
    Wednesday, January 1, 2014 7:15 PM

All replies

  • If on a particular day, an insert record comes in only one of the source CDC tables, we will not be getting that row in the target since the other CDC table doesn't have a record for that particular key.

    If I understand correctly, you extract data using DB2 views (some with joins) and then use third party CDC software to capture all changes made to the underlying DB2 tables.  These changes are then applied to a transformed version of data extracted from the views.  If my understanding is correct, why use DB2 views at all?  It seems to me the transformation process must have intimate knowledge of the underlying DB2 tables anyway in order to apply the CDC data properly.

    Are you saying you are not getting the other CDC table row at all, or is it just that the CDC tables are not transactionally consistent at the time the changes are applied to the target?  I think transactional inconsistencies should be addressed by the third party CDC application. The other alternative as I see it is to query DB2 for a each CDC key to make sure you have the latest data during the transformation process.


    Dan Guzman, SQL Server MVP, http://www.dbdelta.com

    • Proposed as answer by Heidi-Duan Thursday, January 2, 2014 6:09 AM
    • Marked as answer by Elvis Long Tuesday, January 21, 2014 6:11 AM
    Wednesday, January 1, 2014 3:53 PM
    Answerer
  • "This approach would work easily for cases where there is only one source table. But when we have multiple tables joined together to load the target table, we are unable to design an approach. "

    Here's a viable approach.  Perform a full load of the target table each time using TSQL MERGE with the view as the source.  Use the matching in the MERGE to determine whether to INSERT, UPDATE or DELETE rather than the flags in your CDC tables.

    Or don't do that.  Maintain multiple target tables, and after you load the target tables put a view in front of them.

    David


    David http://blogs.msdn.com/b/dbrowne/

    • Proposed as answer by Heidi-Duan Thursday, January 2, 2014 6:09 AM
    • Marked as answer by Elvis Long Tuesday, January 21, 2014 6:11 AM
    Wednesday, January 1, 2014 7:15 PM