none
[ADF] : Write partitioned parquet files

Answers

  • Hi Naceur.BES,

    This can be achieved using Data Flow (Preview), when data source is in Azure. Data Flow doesn't support On-prem data sources and only few Azure data sources are supported currently. (Note: Azure Data Factory Mapping Data Flow is currently public preview feature)

    But as a work around, what you can do is use a Copy activity to move from On-prem to either Blob storage or Azure SQL to stage the data and then use Data Flow to write partitioned data into your Storage.

    Additional info:

    To know more about how Data Flow transformation partitions the data, please refer to: Mapping Data flow transformation optimize tab

    The "Optimize tab" in Data flow transformation has optional settings to configure partitioning schemes for data flows.

    Hope this helps....


    [If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered "Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster. ]

    Wednesday, June 19, 2019 7:43 PM
    Moderator

All replies

  • Hi Naceur.BER,

    Thank you for your question.

    Could you please provide more details about your scenario?

     

    Wednesday, May 22, 2019 7:19 PM
    Moderator
  • Hi KranthiPakala-MSFT ,

    Thank you for reply, 

    I want to load data from On Premise SQL SERVER to blob storage with copy activity in ADF, the target file is parquet, the size of this one is 5 Gb.

    The pipeline work well and he wrote one parquet file, now i need to split this file in multiple parquet file to optimise loading data with Poly base and for another uses.

    With Spark we can partition file in multiple file by this syntaxe : 

    df.repartition(5).write.parquet("path")

    Thursday, May 23, 2019 10:35 AM
  • Hi Naceur.BES,

    Thank you for the details. Here is what I have repro'd copying a table data from on-Premise SQL to Azure Blob, writing partitioned parquet files. 

    Here is source Customer Details table used (just an example):

    Step:1

    • Create a Source Dataset with a linked service connected to the SQL table from which we want to read the data.
    • Create Sink Dataset with a linked service connected to Azure Blob Storage to write the Partitioned Parquet files. Below is the Sink Dataset properties I used for repro.

            

              

    Step: 2

    • Create a Look Activity, which will return unique PersonID's from source table. Based on the PersonID, I will loop through ForEach Activity and Copy separate Parquet file for each unique PersonID. 

    • ForEach Properties:

    • Inside the ForEach activity, we have a Copy activity which will pull the records form source table based on PersonID and write to Azure Blob.

    Here is the output, where the files are split into multiple based on the 'PersonID' from the source table.

    In this example, I implemented to split the files based on unique 'PersonID'. You can customize this part based on your split requirement.

    Hope this helps...

    Friday, May 24, 2019 9:51 PM
    Moderator
  • Hi Naceur.BES,

    Just wanted to check - did your query get resolved using the above suggestion?

    If the above answer was helpful, please click “Mark as Answer” AND/or “Up-Vote”, as it might be beneficial to other community members reading this thread.

    Tuesday, May 28, 2019 5:20 PM
    Moderator
  • hi KranthiPakala-MSFT

    thank you very much for this detailed answer , this exemple is very pertinent but my need is to partition file with round-robin way with fixed number of partition because with your exemple we must choice a column with fixed low cardinality  

    Tuesday, May 28, 2019 5:52 PM
  • Hi Naceur.BES,

    This can be achieved using Data Flow (Preview), when data source is in Azure. Data Flow doesn't support On-prem data sources and only few Azure data sources are supported currently. (Note: Azure Data Factory Mapping Data Flow is currently public preview feature)

    But as a work around, what you can do is use a Copy activity to move from On-prem to either Blob storage or Azure SQL to stage the data and then use Data Flow to write partitioned data into your Storage.

    Additional info:

    To know more about how Data Flow transformation partitions the data, please refer to: Mapping Data flow transformation optimize tab

    The "Optimize tab" in Data flow transformation has optional settings to configure partitioning schemes for data flows.

    Hope this helps....


    [If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered "Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster. ]

    Wednesday, June 19, 2019 7:43 PM
    Moderator
  • hi KranthiPakala-MSFT

    Thank you for replay, this solution work very well for my need

    Wednesday, June 19, 2019 9:15 PM