none
How to improve Performance of U-SQL script

    Question

  • Hello,

    I am processing 100 GB data stored in Data lake in folder and sub folder structure using 1 U-SQL script, but it is taking a lot a lot of time to do the processing even if I am giving it Priority 1 and increasing the parallelism up to number 20. How can I improve the performance of the script so that I can get desired result in minutes?

    Pointers in that direction will also help me a lot.

     


    Thanks, Manthan Upadhyay

    Wednesday, March 9, 2016 6:13 AM

All replies

  • Hi Manthan

    Without seeing your script, I can only guess what you can do to improve performance.

    One suggestion is to look at where the time is being spent:

    • If you use file sets, then the compilation can take several minutes currently (or even time out during compilation). We are working on addressing this.
    • How much data does each file contain? In principle, the processing should be faster if you have larger files.
    • What is the data format in the files? If you have large files, data formats such as CSV or other formats that can be parallelized are normally more performing than non-parallelizable formats such as JSON or XML.

    Some performance tuning suggestions can be found in this presentation: http://www.slideshare.net/MichaelRys/usql-query-execution-and-performance-tuning

    Feel free to ping me with more details offline if you need more help.


    Michael Rys


    Thursday, March 10, 2016 12:04 AM
    Moderator
  • Hi Michael,

    Thank you for your inputs. below are response for each respectively.

    • Yes, I am using file sets, each file is .csv formatted files and resides in folder-sub folder structure.  Currently it is getting timeout in preparation step only as you said.
    • File sizes are very between 0 KB to 15 MBs.
    • All files are in .CSV format. We have used wild card extractor to read file and folder name in Extract query.

     Below is my script:

    @ReadData = 
    EXTRACT Time String,
                    Variable1 double?,
                    Variable2 double?,
                    Variable3 double?,
                    Variable4 double?,
                    Variable5 double?,
                    Variable6 double?,
                    Variable7 double?,
                    Variable8 double?,
                    Variable9 double?,
                    Variable10 double?,
                    Variable11 double?,
                    Variable12 double?,
                    Variable13 double?,
                    Variable14 double?,
                    Variable15 double?,
                    Variable16 double?,
                    Variable17 double?,
                    Variable18 double?,
                    Variable19 double?,
                    Variable20 double?,
                    Variable21 double?,
                    Variable22 double?,
                    Variable23 double?,
                    Variable24 double?,
                    Variable25 double?,
    foldername String,
                    filename String
           FROM "/<RootFolder>/{foldername:*}/<FileNamePrefix>_{filename:*}.csv"                 
           USING Extractors.Csv(silent:true);

    @Result= SELECT foldername.Split('_')[0] AS  X,
    foldername.Split('_')[1] AS  Y,
    "<FileNamePrefix>_"+ filename AS  <FileNamePrefix>,
    MAX(Variable1)                 AS  Max_Variable1,
    MIN(Variable1)                 AS  MIN_Variable1,
    AVG(Variable1)                 AS  AVG_Variable1, 
    STDEV(Variable1)             AS  STDEV_Variable1,
    MAX(Variable2)                 AS  Max_Variable2,
    MIN(Variable2)                 AS  MIN_Variable2,
    AVG(Variable2)                 AS  AVG_Variable2, 
    STDEV(Variable2)             AS  STDEV_Variable2,
    MAX(Variable3)                 AS  Max_Variable3,
    MIN(Variable3)                 AS  MIN_Variable3,
    AVG(Variable3)                 AS  AVG_Variable3, 
    STDEV(Variable3)             AS  STDEV_Variable3,
    MAX(Variable4)                 AS  Max_Variable4,
    MIN(Variable4)                 AS  MIN_Variable4,
    AVG(Variable4)                 AS  AVG_Variable4, 
    STDEV(Variable4)             AS  STDEV_Variable4,
    MAX(Variable5)                 AS  Max_Variable5,
    MIN(Variable5)                 AS  MIN_Variable5,
    AVG(Variable5)                 AS  AVG_Variable5, 
    STDEV(Variable5)             AS  STDEV_Variable5,
    MAX(Variable6)                 AS  Max_Variable6,
    MIN(Variable6)                 AS  MIN_Variable6,
    AVG(Variable6)                 AS  AVG_Variable6, 
    STDEV(Variable6)             AS  STDEV_Variable6,
    MAX(Variable7)                 AS  Max_Variable7,
    MIN(Variable7)                 AS  MIN_Variable7,
    AVG(Variable7)                 AS  AVG_Variable7, 
    STDEV(Variable7)             AS  STDEV_Variable7,
    MAX(Variable8)                 AS  Max_Variable8,
    MIN(Variable8)                 AS  MIN_Variable8,
    AVG(Variable8)                 AS  AVG_Variable8, 
    STDEV(Variable8)             AS  STDEV_Variable8,
    MAX(Variable9)                 AS  Max_Variable9,
    MIN(Variable9)                 AS  MIN_Variable9,
    AVG(Variable9)                 AS  AVG_Variable9, 
    STDEV(Variable9)             AS  STDEV_Variable9,
    MAX(Variable10)             AS  Max_Variable10,
    MIN(Variable10)               AS  MIN_Variable10,
    AVG(Variable10)               AS  AVG_Variable10, 
    STDEV(Variable10)   AS  STDEV_Variable10,
    MAX(Variable11)             AS  Max_Variable11,
    MIN(Variable11)               AS  MIN_Variable11,
    AVG(Variable11)               AS  AVG_Variable11, 
    STDEV(Variable11)   AS  STDEV_Variable11,
    MAX(Variable12)             AS  Max_Variable12,
    MIN(Variable12)               AS  MIN_Variable12,
    AVG(Variable12)               AS  AVG_Variable12, 
    STDEV(Variable12)   AS  STDEV_Variable12,
    MAX(Variable13)             AS  Max_Variable13,
    MIN(Variable13)               AS  MIN_Variable13,
    AVG(Variable13)               AS  AVG_Variable13, 
    STDEV(Variable13)   AS  STDEV_Variable13,
    MAX(Variable14)             AS  Max_Variable14,
    MIN(Variable14)               AS  MIN_Variable14,
    AVG(Variable14)               AS  AVG_Variable14, 
    STDEV(Variable14)   AS  STDEV_Variable14,
    MAX(Variable15)             AS  Max_Variable15,
    MIN(Variable15)               AS  MIN_Variable15,
    AVG(Variable15)               AS  AVG_Variable15, 
    STDEV(Variable15)   AS  STDEV_Variable15,
    MAX(Variable16)             AS  Max_Variable16,
    MIN(Variable16)               AS  MIN_Variable16,
    AVG(Variable16)               AS  AVG_Variable16, 
    STDEV(Variable16)   AS  STDEV_Variable16,
    MAX(Variable17)             AS  Max_Variable17,
    MIN(Variable17)               AS  MIN_Variable17,
    AVG(Variable17)               AS  AVG_Variable17, 
    STDEV(Variable17)   AS  STDEV_Variable17,
    MAX(Variable18)             AS  Max_Variable18,
    MIN(Variable18)               AS  MIN_Variable18,
    AVG(Variable18)               AS  AVG_Variable18, 
    STDEV(Variable18)   AS  STDEV_Variable18,
    MAX(Variable19)             AS  Max_Variable19,
    MIN(Variable19)               AS  MIN_Variable19,
    AVG(Variable19)               AS  AVG_Variable19, 
    STDEV(Variable19)   AS  STDEV_Variable19,
    MAX(Variable20)             AS  Max_Variable20,
    MIN(Variable20)               AS  MIN_Variable20,
    AVG(Variable20)               AS  AVG_Variable20, 
    STDEV(Variable20)   AS  STDEV_Variable20,
    MAX(Variable21)             AS  Max_Variable21,
    MIN(Variable21)               AS  MIN_Variable21,
    AVG(Variable21)               AS  AVG_Variable21, 
    STDEV(Variable21)   AS  STDEV_Variable21,
    MAX(Variable22)             AS  Max_Variable22,
    MIN(Variable22)               AS  MIN_Variable22,
    AVG(Variable22)               AS  AVG_Variable22, 
    STDEV(Variable22)   AS  STDEV_Variable22,
    MAX(Variable23)             AS  Max_Variable23,
    MIN(Variable23)               AS  MIN_Variable23,
    AVG(Variable23)               AS  AVG_Variable23, 
    STDEV(Variable23)   AS  STDEV_Variable23,
    MAX(Variable24)             AS  Max_Variable24,
    MIN(Variable24)               AS  MIN_Variable24,
    AVG(Variable24)               AS  AVG_Variable24, 
    STDEV(Variable24)   AS  STDEV_Variable24,
    MAX(Variable25)             AS  Max_Variable25,
    MIN(Variable25)               AS  MIN_Variable25,
    AVG(Variable25)               AS  AVG_Variable25, 
    STDEV(Variable25)   AS  STDEV_Variable25
    FROM @ReadData
    GROUP BY foldername,filename;

    OUTPUT @Result
    TO "/<OUTPUT_ROOT_FOLDER>/output_FileName.Csv"                
    USING Outputters.Csv();

    I will go through the link you have provided and try to implement those suggestions.

    Let us know how we can connect you offline, so that we can discuss in detail.


    Thanks, Manthan Upadhyay

    Thursday, March 10, 2016 5:23 AM
  • Having too many files (especially small files) will currently cause a large compilation time thus making your end to end job execution time too slow. The only way to address that is to limit the number of files to a smaller number.

    You can do that by either merging the files (or use the append feature to append data to the files) or write scripts that select less files (e.g., one script per folder or by parameterizing a filter on the file set virtual columns).A

    As I mentioned, we are working on addressing the compile time issue, but the above advise will still help you with better performance most likely.

    As to my direct contact information, the deck I pointed you to contains my contact information.


    Michael Rys

    • Proposed as answer by Michael Amadi Sunday, March 13, 2016 8:19 AM
    Friday, March 11, 2016 1:17 AM
    Moderator
  • What is the append feature? Does this allow U-SQL to append data to an existing CSV file? If so, how do you use it?
    Friday, July 15, 2016 2:46 PM
  • What is the append feature? Does this allow U-SQL to append data to an existing CSV file? If so, how do you use it?

    You cannot append to files using U-SQL. But the file system has append capabilities through their file APIs.


    Michael Rys

    Friday, July 15, 2016 6:34 PM
    Moderator
  • Will this functionality be added at some point?
    Friday, July 15, 2016 6:43 PM
  • If you have feature requests, please file them through http://aka.ms/adlfeedback. That allows us to see collect and collate the community feedback and feed it into our planning process.

    Michael Rys

    Friday, July 15, 2016 6:59 PM
    Moderator