none
USQL Horrendous performance

    Question

  • I 'think' that this problem is relating to the query optimization that Azure Data Lake Analytics does; but let's see...

    My data set is a single managed table of +/- 400M records.
    I have 2 separate queries (TVFs) doing aggregations, and then a final Query to join the 2 together for final results.

    So ...
    Table >  Header Query
    Table >  Detail Query
    Result = Header Query + Detail Query

    To test the whole logic out, I run the minor queries separately with a filter, storing the results to file, and then use the hard files as sources for the final query; these are the total duration (minutes).
    Header Query  1.4  (408 rows)
    Detail Query  0.9  (3298 rows)
    Final Query   0.9  (408 rows)

    So I know that as a maximum, I can get my result in around 3.5 minutes. However, I don't really want to create new intermediary files. I want to use the TDFs directly to feed the final query.

    With TDFs called directly in the final query, the Job Graph gets to around 97% progress within about 1.5 minutes. But then, all hell breaks loose ! The last node is a Aggregate with 2,500 Vertices that says 16 minutes compute time. So my question ... WHY ??
    In the end I had to cancel at 20+ minutes.

    Yes, I have some inequal joins in there, but, as I state above, when the sub queries are run into flat files and used as sources, the query runs swiftly.

    I'm guessing that the engine is trying to 'optimise', and in doing so, is ignoring the separation of designed elements ?
    Of course I could force the staged execution of the initial 2 queries in a procedure, and write the results to file before the final query. But what then happens when I want to run against the full 400M ? 

    In the meantime, at 20+ minutes a go ...the bills are starting to stack up !

    Is this a case of me not understanding some fundamental concepts of how Azure works ? 
    So, can anyone explain what's going on? Any help 


    Final Query:

    @Header = 
    SELECT [CTNNumber],       [CTNCycleNo],       [SeqStart],       [SeqEnd],       [StartUTC],       [EndUTC],       [StartLoc],       [StartType],       [EndLoc],       [EndType],       [Start Step],       [Start Ctn Status],       [Start Fill Status],       [EndStep],       [End Ctn Status],       [End Fill Status]
    FROM [Play].[getCycles3]("12345") AS X;
    
    @Detail =
    SELECT [CTNNumber],       [SeqNo] AS [SeqNo],       [LocationType],       [LocationID],       [BizstepDescription],       [ContainerStatus],       [FillStatus],       [UTCTimeStampforEvent]
    FROM [Play].[getRaw] ("12345") AS Z;
    
    @result =    
    SELECT        H.[CTNNumber], H.[CTNCycleNo], H.[SeqStart], H.[SeqEnd]        ,COUNT([D].[SeqNo]) AS [SeqCount]        //, COUNT(DISTINCT [LocationID]) AS [#Locations]    
    FROM         @Header AS [H]        
                    INNER JOIN @Detail AS [D]        
    ON         [H].[CTNNumber] == [D].[CTNNumber]     
    WHERE         
              [D].[SeqNo] >= [H].[SeqStart] AND        
              [D].[SeqNo] <= [H].[SeqEnd]      
    GROUP BY         
               H.[CTNNumber], H.[CTNCycleNo], H.[SeqStart], H.[SeqEnd]    ;

    Monday, October 2, 2017 8:52 AM

All replies

  • I'd be happy to help.  This looks like a duplicate of a Stackoverflow post.  How about we focus the response there. https://stackoverflow.com/questions/46407002/usql-nesting-tvfs-and-queries-is-giving-horrendous-results
    Wednesday, October 4, 2017 10:16 PM
  • Hi Gary,

    Thanks for your offer of help.
    I don't like duplicate posting on different sites, just got a bit desperate for responses on this one.
    (even if it's to be told that I'm asking a dumb question is sometimes enough!)
    Will gladly close this one down if things get rolling on SO.

    Where do you want to start ?
    Is there more info I can give that would help ?

    Many Thanks

    Simon

    Thursday, October 5, 2017 8:49 AM