none
Data Lake Analytics U-SQL EXTRACT speed (Local vs Azure)

    Question

  • Hi Folks,

    Been looking into using the Azure Data Lake Analytics functionality to try and manipulate some Gzip’d xml data I have stored within Azures Blob Storage but I’m running into an interesting issue.  Essentially when using U-SQL locally to process 500 of these xml files the processing time is extremely quick ,  roughly 40 seconds using 1 AU locally (which appears to be the limit). However when we run this same functionality from within Azure using 5 AU’s the processing takes 17+ minutes.

    We are eventually wanting to scale this up to ~ 20,000 files and more but have reduced the set to try and measure the speed.

    Each file containing a collection of 50 xml objects (with varying amount of detail contained within child elements),  the files are roughly 1 MB when Gzip’d and between 5MB and 10MB when not.  99% of the time processing time is spent within the EXTRACT section of the u-sql script.

    Things tried,

    1. Unzipped the files before processing, this took roughly the same time as the zipped version,  certainly nowhere near the 40 seconds I was seeing locally.
    2. Moved the data from Blob storage to Azure Data Lake storage,  took exactly the same length of time.
    3. Temporarily Removed about half of the data from the files and re-ran,  surprisingly this didn’t take more than a minute off either.
    4. Added more AU’s to increase the processing time,  this worked extremely well but isn’t a long term solution due to the costs that would be incurred.

    It seems to me as if there is a major bottleneck when getting the data from Azure Blob Storage/Azure Data Lake.  Am I missing something obvious.

    P.S. Let me know if you need any more information.

    Thanks,

    Nick.

    Wednesday, May 23, 2018 4:12 PM

All replies

  • Hi Nick,

    See slide 31 of "Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konferenz 2018)". There is a preview option

    SET @@FeaturePreviews="InputFileGrouping:on";

    which groups small files into limited vertices.

    Thursday, May 24, 2018 2:33 PM