How to get better performance on large files in ADLS and U-SQL


  • I have a Data Lake that has a RAW, Staging, and Curated area.  When I am "cooking" my data I am merging it with the curated files.  I have some files in curated that are extremely large - over 100gb of data and a billion records and growing.  

    To get the data from RAW and into Curated, I am checking for records that already exist but need to be "updated".  Essentially I am reprocessing the Curated data every day.  This is very costly to do.  What is the recommended method to get all the RAW data into a Curated folder on a daily processes.  Is there a better way to process so that I don't have to process all the data over every day?

    Monday, February 18, 2019 10:39 PM

All replies