none
Moving 100 GB of data on a daily basis

    Question

  • I would like to use Azure Data Factory pipeline to move 100 GB of data from azure data lake to azure table storage.

    At the moment I have a single 100 GB file in ADL that has few columns:

    Partition (string with 2 characters)

    Key (string)

    Date (date)

    Column1 (integer)

    Column2 (integer)

    I've tried to run the pipeline but it was working much slower then expected (if I let it continue, it will run for several days). Please help me learn how to investigate ADF performance and what parameters I should try changing.

    Useful information:

    - All Azure services are located in the same data center.

    - Data in ADL is sorted by PartitionKey.

    - I've set "Degree of copy parallelism" on the copy activity to 64.

    - I've tried increasing "Data integration unit" on the copy activity as well but it never uses more than 4 DIU when pipeline is ran. 


    Tatyana Yakushev [tableau.com]

    Wednesday, October 24, 2018 6:08 PM

All replies

  • Hello Tatyana,

    Do you know what 'throughput' and 'usedParallelCopies' your activity is achieving?  There is more information on copy activity monitoring here if you haven't seen it:

    https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-overview#monitoring

    Wednesday, October 24, 2018 7:30 PM
    Moderator
  • Have you checked this performance tuning doc?

    Monday, October 29, 2018 8:30 AM
  • Hi Tatyana,

    • For you case, is it possible to split the single 100 GB file into multiple smaller files (e.g. 100 * 1GB files)? In that case, ADF Copy activity can utilize more than 4 DIUs to further enhance your performance.
    • In your source file, are the rows sorted by the Partition column? If so, you can specify the azureTablePartitionKeyName property in AzureTableSink, so that ADF Copy would use that column as the partition column in the Azure Table, and it could leverage multiple partitions in Azure Table and have better perf.
    • It is also helpful if you could provide a sample runId of your triggered copy run.
    Monday, October 29, 2018 9:00 AM
  • Yes, I've read the performance tuning doc and changed usedParallelCopies and max DIU parameters.  Yes, the file is sorted by the partition column and I've specified it in the AzureTableSink. 

    It never uses more than 4 DIU. I will ask my peer to generate several smaller files so that ADF can utilize more than 4 DIU. 

    Here is one of my run IDs 11efeff2-e568-4706-827a-67031e389e7f

    (note that it failed after running for more than 23 hours. The error is "The request to 'Azure Data Lake Store' failed due to unexpected error: Unable to connect to the remote server". Copy activity needs to be more resilient to the network problems)


    Tatyana Yakushev 


    Monday, October 29, 2018 4:28 PM
  • Thanks Tatyana for sharing the detailed info.

    Based on your current throughput, it would need around (or slightly more than) 1 day to finish the 100 GB data load. Please do use multiple smaller files, and it should have multiple times better throughput and you can load the data in few hours.

    For the failure you met, I noticed that there were already many times retry happened but still failed. It seems certain network issue during that period, and will further take a look on it.

    Tuesday, October 30, 2018 1:18 AM