none
Incremental load in Azure Data Lake with Large file size

    Question

  • Hi,

    I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.

    My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.

    Now, there are huge volume of tables (that has even billion +) that I want to load from SQL DB using DF to Azure DL. 

    MS docs mentioned two options, i.e. watermark columns and change tracking.

    Let's say I have a "cust_transaction" table that has millions of rows and if I load to DL then it loads as "cust_transaction.txt". 

    Questions.

    1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?

    2) How do I split or partition the files into smaller files?

    3) How should I merge and load the deltas from source data into the files?

    Thanks.


    M. Chowdhury

    Monday, November 12, 2018 7:14 PM

All replies

  • If you want to partition current table and later on data, a tumbling window trigger will be a good fit for your pipeline - by which data in your big table will be partitioned automatically in serial sequential time window; and then those data will be copied to separately files in data lake. You can take a look at below document.

    https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger


    Tuesday, November 13, 2018 1:03 PM
  • More examples on this topic: https://docs.microsoft.com/en-us/azure/data-factory/how-to-read-write-partitioned-data#use-a-pipeline-parameter

    Tuesday, November 13, 2018 1:22 PM
  • If you want to partition current table and later on data, a tumbling window trigger will be a good fit for your pipeline - by which data in your big table will be partitioned automatically in serial sequential time window; and then those data will be copied to separately files in data lake. You can take a look at below document.

    https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger


    Hi Nicolas,

    Thanks for your reply. I looked at the tumbling window trigger option within data factory but not completely sure to see how it will rectify my requirements automatically, i.e.

    1) to load a very large table from Azure SQL DB to Azure DL storage

    2) Load deltas from Azure SQL DB to DL Storage &

    3) Partition tables and files in Azure SQL DB and DL.

    I would appreciate your response.


    M. Chowdhury

    Tuesday, November 13, 2018 5:35 PM
  • With tumbling window, you can read data from Azure DB with query like below, which will by nature partition your data and write to a separate file in DL for this batch of data. 

    select * from MyTransactionTable
    where CreatedDate >= '@{formatDateTime(pipeline().parameters.windowStart, 'yyyy-MM-dd HH:mm' )}'
    AND CreatedDate < '@{formatDateTime(pipeline().parameters.windowEnd, 'yyyy-MM-dd HH:mm' )}'

    And if tumbling window is used, ADF pipeline will trigger concurrent activity runs at the same time - e.g. your tumbling window is hourly scheduled from 11/01/2018, and you set the concurrent activity run number as 24, there will 24 activity runs to copy data in different time windows in 11/01/2018, and there are 24 files are written in DL at the same time.

    And since those data is queried from Azure DB in time window serials, previously copied data will not be read and copied again.


    Wednesday, November 14, 2018 1:42 AM
  • Thanks. So, the output in the datalake will have multiple version of files per each table or will they append the deltas to the same file?

    For huge tables, this is logical to have partitioned files...

    The final output is for some users to access the raw data dump in the DL via power BI. In that case, will each entity within DL be presented to users as one file for viewing but if there will be multiple files (due to partition or versions) then how will that be consolidated to view?


    M. Chowdhury

    Wednesday, November 14, 2018 3:41 AM
  • The output in the datalake will have multiple version of files per table. 

    For merging multi-files in PBI, I'm not PBI expert, but here are some documents I find, hope them helps.

    https://docs.microsoft.com/en-us/power-bi/desktop-combine-binaries

    https://community.powerbi.com/t5/Desktop/Read-Multi-CSV-files-and-generate-PBI-reports/td-p/108233

    Wednesday, November 14, 2018 1:18 PM
  • With tumbling window, you can read data from Azure DB with query like below, which will by nature partition your data and write to a separate file in DL for this batch of data. 

    select * from MyTransactionTable
    where CreatedDate >= '@{formatDateTime(pipeline().parameters.windowStart, 'yyyy-MM-dd HH:mm' )}'
    AND CreatedDate < '@{formatDateTime(pipeline().parameters.windowEnd, 'yyyy-MM-dd HH:mm' )}'

    And if tumbling window is used, ADF pipeline will trigger concurrent activity runs at the same time - e.g. your tumbling window is hourly scheduled from 11/01/2018, and you set the concurrent activity run number as 24, there will 24 activity runs to copy data in different time windows in 11/01/2018, and there are 24 files are written in DL at the same time.

    And since those data is queried from Azure DB in time window serials, previously copied data will not be read and copied again.



    One final question regarding tumbling window..for the incremental load, how will it handle updates to existing data? Looks like it will take care of new rows to insert using the window but not sure how it handles existing data changes...

    M. Chowdhury

    Wednesday, November 14, 2018 2:59 PM
  • It depends how you choose the time in the SQL query,  for example if some records were copied to files in DL several days ago, but those records are updated just now, if you use some column like "updateTime" to select data, those newly updated records will be in new tumbling window and thus copied to new file; and if you choose some column like "createTime" to select data in SQL query, those updated record will never be copied to new file any more.
    Thursday, November 15, 2018 1:54 AM