none
Partitioning Large files in U-SQL/Data Lake

    Question

  • I need help understanding the proper partitioning strategy in U-SQL/Data Lake with very large files.  I have a 50gb file (people information) that I need to do deduping on using USQL and I essentially am doing a cross join against itself in the script to do this.  This is a very long process without partitioning the data.  To do this I am not sure what the best approach is.

    Should I first use Data Factory (this is how I'm loading the ADL) to partition the RAW file into multiple files by Last Name? So instead of one "people.csv" files I would have people_a.csv, people_b.csv etc.  Then read the files as one in U-SQL via SELECT [COLUMNS] FROM "RAW/people_{lastnameinit}.csv"? 

    Or should I maintain the one production "people.csv" file but partition the data by creating a table with an index and inserting the data into this table and then using the new table to do my deduping logic?  What is the best approach?

    Monday, May 7, 2018 2:53 PM

Answers

  • How are you deduping? Can you please share it if you can, it would be easy to help in this case. To answer your question, I dont think you have to partition into files if you are trying to do via USQL because when deduping you have to collect all data by reading all these files, it would be easier to read all data from single file instead of multiple files.
    Thursday, May 10, 2018 9:44 PM
    Moderator

All replies

  • How are you deduping? Can you please share it if you can, it would be easy to help in this case. To answer your question, I dont think you have to partition into files if you are trying to do via USQL because when deduping you have to collect all data by reading all these files, it would be easier to read all data from single file instead of multiple files.
    Thursday, May 10, 2018 9:44 PM
    Moderator
  • Thanks for the reply.  This is the route I ended up going (a large singular file).  The performance hit that I was experiencing was because of a bad join due to how U-SQL handles Null values.  I also switched the Cross Join into separate SELECT statements and union-ed them together and this proved to fix the bottleneck.  In the end, with the corrected SELECT statement, I didn't need to do any partitioning as U-SQL handled the data efficiently (million records in under 2 minutes). 
    Wednesday, May 16, 2018 2:22 PM
  • I am glad that it worked
    Wednesday, May 16, 2018 6:17 PM
    Moderator