Split ADLS file into smaller chunks RRS feed

  • Question

  • Hi there,

    We have a ADF pipeline which is converting a structured stream(source) to parquet(Sink/target). the source is in ADLS Gen 1 and target is ADLS gen 2. The source file is 6TB. When I run a normal copy activity it takes days. How can I make this pipeline efficient? - Data Integration Unit is already set to 256

    Is there a way to split the big file quickly into bunch of small blobs so that copy activity (parquet conversion) can happen in parallel?


    Tuesday, August 20, 2019 5:49 AM

All replies

  • Hi there,

    Unfortunately, there's no in-house way to split a file into small chunks and copy it. You can however use a custom activity or an Azure function activity to read the file from the source (using data lake sdks or rest apis) and then copy them to the destination.

    Please note : The split operation will also consume considerable time, but might reduce the time it takes today.

    Tuesday, August 20, 2019 7:01 AM
  • Hi there,

    Just wanted to check - was the above suggestion helpful to you? If yes, please consider upvoting and/or marking it as answer. This would help other community members reading this thread.

    Thursday, August 22, 2019 7:14 AM
  • You can use dataflow which can partition and write
    Tuesday, August 27, 2019 12:33 AM
  • Hi,

    You can explore other options i.e. Azure databricks or Spark cluster which are better in terms of performance.


    Tuesday, August 27, 2019 1:44 AM
  • Dataflow is metadata surface to author ETL pipelines and runs on Spark. If you are on ADF, your obvious choice is dataflow
    Tuesday, August 27, 2019 7:44 PM
  • Hi there,

    You can also use Data Flow feature as suggested above. 

    This can be achieved using Data Flow (Preview), when data source is in Azure. Data Flow doesn't support On-prem data sources and only few Azure data sources are supported currently. (Note: Azure Data Factory Mapping Data Flow is currently public preview feature)

    But as a work around, what you can do is use a Copy activity to move from On-prem to either Blob storage or Azure SQL to stage the data and then use Data Flow to write partitioned data into your Storage.

    Additional info:

    To know more about how Data Flow transformation partitions the data, please refer to: Mapping Data flow transformation optimize tab

    The "Optimize tab" in Data flow transformation has optional settings to configure partitioning schemes for data flows.

    Ref - https://social.msdn.microsoft.com/Forums/en-US/761fd153-bffe-405a-9383-ce356bf31ce0/adf-write-partitioned-parquet-files?forum=AzureDataFactory

    Hope this helps.

    Wednesday, August 28, 2019 9:24 AM
  • Hi there,

    Just wanted to check - did the above suggestion help you ? If yes, please consider upvoting and/or marking it as answer. This would help other users reading this thread,

    Monday, September 16, 2019 8:29 AM
  • Hi there,

    I haven't heard back from you in quite some time now. Was your query resolved with the above suggestion? If yes, please consider upvoting and/or marking it as answer. This would help other users reading this thread. 

    Wednesday, September 18, 2019 6:52 AM