locked
How to split the input blob storage files in ADF Process RRS feed

  • Question

  • Hi,

    My input blob storage file data size in GB's, want to split into small sizes then store into another blob storage called outputblob storage. From outputblob storage will move data on to Azure Data Warehouse.

    Using HDInsight with spark/python scripting want to split.

    Please let me know the code for the same

    Thursday, October 12, 2017 5:56 PM

All replies

  • why do you want to split the files? 

    Azure SQL DW works best with big files (ideally 500MB)
    files bigger than 500MB are split and processed by multiple workers to achieve parallel processing

    this only does not work if your files are compressed - if this is not the case I do not see any point to split the files

    -gerhard


    Gerhard Brueckl
    blogging @ http://blog.gbrueckl.at
    working @ http://www.pmOne.com

    Friday, October 13, 2017 10:44 AM
  • My files are in 1TB, I want to split in 200MB files then want to push on to HDInsight.

    Please suggest me best practice of split in ADF then how to push on to HDInsight.

    Friday, October 13, 2017 10:53 AM
  • We have done a lot of such splitting jobs - but they range from using Linux's Split commands to writing MR jobs.

    So you need a program that will break down a 100GB file into 10 files of 10 GB each?
    But this may truncate data and cause data corruption when the split files are combined back together.

    Can you please be a bit more specific on your requirement?

    ------------------

    ThirdEye Data
    https://thirdeyedata.io/


    Friday, October 13, 2017 6:44 PM