MongoDB to SQL Server Incremental Loading RRS feed

  • Question

  • Hi everyone,

    I have been working on a data factory pipeline in Visual Studio for a little while now and can't seem to find a correct procedure to do what I want. I am trying to incrementally load into a SQL server from a mongoDB source,  however, a third party is providing the mongoDB data in a SQL interface with a BI connector.

    I have been using the SSIS solution with c# scripts to read from the BI connector (using an ODBC driver), iterating through each table to cross reference the databases to check for any structure changes or added tables, then generating any necessary queries to update/add the respective table in the SQL server. To then load newly added data to the server (based off 'codingsight.com/implementing-incremental-load-using-change-data-capture-sql-server/'). I have been able to bulk load the data in by replacing everything, however, as the DB grows this isn't a sustainable solution as already it takes a long time to do so.

    After a while I've been able to generate the scripts to update/add however the incremental loading is giving me trouble. I am a student working at a start-up company, I never touched Azure (or even databases) before this project, and I just wanted some insight if this was the best way to approach this problem, or even if the procedure is possible. Thank you. 

    Thursday, August 1, 2019 7:01 PM

All replies

  • Hello Perrinii and thank you for your inquiry and well written ask.  While it is possible to do with Data Factory, I want to inform you of other options.  These options include Azure Database Migration Service.  Since you are interacting with a SQL interface, and targeting Azure SQL Server, your situation may be applicable.

    If you want to do using Data Factory, there are more things I need to know.  Have you tried accessing the transaction logs?  Is there a 'last_modified_date' like property you can query?  What trouble  is the incremental loading giving you?

    Monday, August 5, 2019 9:26 PM
  • Thank you for the response, I actually haven't tried accessing the transaction logs. Using the reference link above, the problem I was running into was the creation of duplicate _CT table describing the same table.

    After some quick querying I was able to find a 'last_modifed_date' column for each row, it seems I was doing a lot of overkill as the last modified times were already easily available (in format ex. '2019-02-06T22:08:57.469Z'). Still using data factory do you think just querying then inserting rows which are past a certain date would be the best procedure? 

    Tuesday, August 6, 2019 2:00 PM
  • Yes.  That would work, as long as you are certain that it include deletions.  If a record no longer exists, then it cannot have a last modified date, right?  There is a tutorial in the ADF docs about using 'watermarks' for incremental loads.  I think that will work well with the last_modified_date.
    Tuesday, August 6, 2019 6:13 PM
  • Okay, I will look into that. Thank you for you help!
    Tuesday, August 6, 2019 6:54 PM
  • Did my suggestion fix your issue, or do you need more help?  If it fixed the issue, please mark as answered.
    Wednesday, August 7, 2019 5:44 PM
  • @Perrinii, are you still following this thread?  After further consideration, I realized I didn't give you the best answer.  If you are still blocked, let me know.  
    Monday, August 12, 2019 7:02 PM