none
Optimal file storage for Data Lake

    Question

  • I've read that a best practice for storing data in Data Lake gen 1 is to store data in files no smaller than 256mb.  I am currently doing daily incremental loads into our Data Lake into folders partitioned by date and files by source/table.  The root directory in the ADL I am calling RAW. Example RAW/SourceSystem/Year/Month/Day/SourceTableName.csv

    Sometimes there's only a couple of new records which means the file is very small.  Is it okay to keep the files stored in this manner even if they are smaller in size?  I am eventually merging the incremental file with a production file in a different directory so that I have one large file per source table.  So PRODUCTION/SourceSystem/SourceTableName.csv that contains all data.

    I'm curious if this makes sense or if there's a better file structure in ADL.  Essentially I have a RAW directory that contains many small incremental files partitioned by date, and a PRODUCTION directory that contains many large files that contains all the RAW data merged together. 


    Tuesday, December 4, 2018 4:44 PM

All replies

  • Hi FrankMn

    When data is stored in Data Lake Storage Gen1, the file size, number of files, and folder structure have an impact on performance. Performance may depend on the final data which you are going to use.

    You may refer to the documentations Performance and scale considerations and Structure your data set and detailed explanation on the Stack Overflow thread which addresses similar query.

    Wednesday, December 5, 2018 11:29 AM
    Moderator
  • Hi FrankMn

    When data is stored in Data Lake Storage Gen1, the file size, number of files, and folder structure have an impact on performance. Performance may depend on the final data which you are going to use.

    You may refer to the documentations Performance and scale considerations and Structure your data set and detailed explanation on the Stack Overflow thread which addresses similar query.

    Thanks for posting this information.  This is the same structure I have implemented.  I have one last question though, when doing an initial load of data (not incremental) that is for say 3 years of data, would it make sense to partition that into separate files in RAW by row date, or store the entire initial load into one file in RAW?  My incremental loads will be partitioned daily by the timestamp of the record, which means if I loaded 3 years of data into RAW initially records would have all different ranges of dates.  Should I split those into separate smaller files, one file per date, or load them into one large file in RAW that has many dates in it?
    Monday, December 10, 2018 4:30 PM