none
UnZip File( s) and Copy Into Azure Data Lake in ADF RRS feed

  • Question

  • Hello,

    I have bunch of zip Files in folder and sub-folders in  Blob container and Planned to use ADF Copy Activity to Unzip to .CSV and Copy Files to Azure Datalake. Ex below

    Source File structure ( Blob Storage): 

    Container/folder1/202001/file1.zip

    Container/folder1/202001/file2.zip

    Container/folder1/202001/file3.zip

    Container/folder2/201912/file1.zip

    Container/folder2/201912/file2.zip

    I wanted to unzip them and copy into ADLS as below,

    Expected Destination Folder structure ( ADLS):

    Container/folder1/202001/file1.csv

    Container/folder1/202001/file2.csv

    Container/folder1/202001/file3.csv

    Container/folder2/201912/file1.csv

    Container/folder2/201912/file2.csv

    But I cannot able to do that during Copy Activity because Copy-activity automatically creating next sub-folder with .zip extension and then placing .csv file in side that which I don;t want.

    ex:Container/folder2/201912/file2.zip/file2.csv

    And also, I don't want to use Get Metadata activity to get all files and pass through for-each activity which internally called Copy activity because Get Metadata Activity has  1 MB Limitation  file names reading limitation. we have more files to process.

    Microsoft  described as below,

    Copy zipped files from an on-premises file system, decompress them on-the-fly, and write extracted files to Azure Data Lake Storage Gen2

    https://docs.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support

    I really no idea how this is done. Please suggest me how to process .zip files into .csv and copy files in same folder/subfodler structure in ADLS  using ADF

    Appreciated for your help on this.

    Thanks.

    Friday, January 24, 2020 5:06 AM

All replies

  • Hi there,

    You have the option to go with Azure Data Factory. Azure Data Factory supports to decompress data during copy. Specify the compression property in an input dataset and the copy activity reads the compressed data from the source and decompress it.

    Also,there is an option to specify the property in an output dataset which would make the copy activity compress then write data to the sink.

    For your use-case - you need to read a compressed (example GZIP) data from an Azure blob, decompress it and write result data to an Azure blob, so define the input Azure Blob dataset with compression type set to GZIP.

    Link - ADF - compression support

    In the Sink Dataset, please make sure you provide the right folder/subfolder structure - this will determine how the final structure is.

    Hope this helps.

    Ref - https://stackoverflow.com/a/51768355/10653466

    Monday, January 27, 2020 3:06 PM
    Owner
  • Thanks for reply.

    I followed the given ref and tried to same the but as I specified that, its unzipping and copying the after creating one more subfolder and placing the .csv file inside of it which we don't want.

    Below is the output when I follow the ref

    Container/folder2/201912/file2.zip/file2.csv

    But we wanted as below

    Container/folder2/201912/file2.csv


    Saturday, February 1, 2020 2:54 AM
  • Hi Nani,

    Can you please share what you have for mapping and in the sink dataset settings for file path?

    Monday, February 3, 2020 6:09 AM
    Owner
  • Please find the attached word document for my source and sink settings for copy activity. I will provide delete activity also which is same as Copy activity source.

    I am doing binary copy with recursive option.

    

    Wednesday, February 5, 2020 5:43 PM
  • This is a bug in the Azure Data Factory. 

    Regards

    Rajaniesh

    Friday, February 14, 2020 9:46 PM
  • Hi Nani,

    I know what you are talking about in regards to the subfolders.  As far as I know that is just how it is and I think intentional because if a zip file had a couple of files in it and you put it in the same folder, there could potentially be a naming collision with the files in the zip file going in the same folder or "pseudo" folder in blob storage.  


    Friday, February 14, 2020 10:02 PM