none
Change in file set pattern results in performance warning

    Question

  • I changed my file set pattern in accordance with the upcoming deprecation.

    Now I'm seeing this warning:

    "All final queries over stream sets should have a predicate on all virtual columns to enable partition elimination for better query performance."

    The error points to an EXTRACT statement. Here is my simplified code:

    DECLARE @in = @"SomeFolder/{fName}-20{fYear}-{fMonth}-{fDay}.txt";
    
    @TABLE =
        EXTRACT someValue string,
                fStudy string,
                fYear string,
                fMonth string,
                fDay string,
                error SqlMap<string,string>
        FROM @in
        USING new CustomCsvExtractor();
    
    @TABLE =
        SELECT someValue,
               string.Format("{0}-20{1}-{2}-{3}.txt", fName, fYear, fMonth, fDay) AS filename,
               error
        FROM @TABLE;
    

    NOTE: I know this code might seem odd, but we only want certain file patterns, but also want the filename, so this is the workaround we determined.

    Any idea is causing the warning?

    Wednesday, August 31, 2016 7:38 PM

Answers

  • Let's assume your fileset pattern is selecting 100k files (e.g., hourly logs over a long time range). But you really only are interested in looking for the last 7 days (which is a more manageable 168 files).

    If you write the query in either of the following way:

    @d = EXTRACT c1 int, c2 string, date dateTime 
         FROM "/path/{date:yyyy}/{date:MM}/{date:dd}/{date:HH}.csv" 
         USING Extractors.Csv();
    
    @d = SELECT * FROM @d WHERE c1 > 100;
    
    ...

    or

    @d = EXTRACT c1 int, c2 string, date dateTime 
         FROM "/path/{date:yyyy}/{date:MM}/{date:dd}/{date:HH}.csv" 
         USING Extractors.Csv();
    
    @d = SELECT * FROM @d WHERE some_function_that_we_cannot_use_to_eliminate_files(date);
    
    ...

    You will end up with trying to read all 100k files. In the first case it is expected, but maybe you still didn't want to look at all files. This is your case, except that you are fine with reading all the files.

    In the second case, someone may expect the predicate to eliminate the files from the plan. But since we do not know what the function does, we will have to still read all files and thus raise the warning to make sure the query author is aware of it.

    So how do I get the plan down to only read a smaller number of files? By writing a predicate that the optimizer understands how to use to eliminate the files, such as:

    @d = SELECT * FROM @d WHERE date BETWEEN @start_date AND @end_date;

     So see the warning as a way to let the user know that the used expression or missing expression will not help eliminating the file.

    Also, a predicate on a non-virtual column obviously still requires us to read all the files to check for the condition.


    Michael Rys

    • Proposed as answer by Michael Amadi Thursday, September 1, 2016 7:24 PM
    • Marked as answer by Misinformed DNA Thursday, September 1, 2016 9:50 PM
    Thursday, September 1, 2016 7:09 PM
    Moderator

All replies

  • The warning is raised because you are looking for all files and do not have a predicate specified. If that is what you want, then you can safely ignore the warning.

    We raise the warning because we want to make sure that you are aware of the fact that if you do not specify a predicate, or specify a predicate which is not used for partition elimination, that you will get a job that could be too big/costly.

    I hope this helps (and thanks for pointing out that it may be a bit confusing, we will need to work on the wording :)).


    Michael Rys

    • Proposed as answer by Michael Amadi Thursday, September 1, 2016 2:59 PM
    Wednesday, August 31, 2016 11:31 PM
    Moderator
  • Is there a predicate in file sets itself or are you referring to a simple WHERE predicate in a SELECT clause?
    Thursday, September 1, 2016 2:08 PM
  • I believe Michael was referring to the WHERE predicate in the SELECT clause since that's where you can influence the partition elimination/pruning.

    Regards,

    Michael Amadi

    Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to vote it as helpful :)

    Website: http://www.nimblelearn.com, Twitter: @nimblelearn

    Thursday, September 1, 2016 2:59 PM
  • So why is it "more important" too have a WHERE predicate on a virtual column vs a standard column?

    I'm not seeing the point of the warning message. Clearly, any smaller table size would result in faster performance.

    Thursday, September 1, 2016 6:48 PM
  • Let's assume your fileset pattern is selecting 100k files (e.g., hourly logs over a long time range). But you really only are interested in looking for the last 7 days (which is a more manageable 168 files).

    If you write the query in either of the following way:

    @d = EXTRACT c1 int, c2 string, date dateTime 
         FROM "/path/{date:yyyy}/{date:MM}/{date:dd}/{date:HH}.csv" 
         USING Extractors.Csv();
    
    @d = SELECT * FROM @d WHERE c1 > 100;
    
    ...

    or

    @d = EXTRACT c1 int, c2 string, date dateTime 
         FROM "/path/{date:yyyy}/{date:MM}/{date:dd}/{date:HH}.csv" 
         USING Extractors.Csv();
    
    @d = SELECT * FROM @d WHERE some_function_that_we_cannot_use_to_eliminate_files(date);
    
    ...

    You will end up with trying to read all 100k files. In the first case it is expected, but maybe you still didn't want to look at all files. This is your case, except that you are fine with reading all the files.

    In the second case, someone may expect the predicate to eliminate the files from the plan. But since we do not know what the function does, we will have to still read all files and thus raise the warning to make sure the query author is aware of it.

    So how do I get the plan down to only read a smaller number of files? By writing a predicate that the optimizer understands how to use to eliminate the files, such as:

    @d = SELECT * FROM @d WHERE date BETWEEN @start_date AND @end_date;

     So see the warning as a way to let the user know that the used expression or missing expression will not help eliminating the file.

    Also, a predicate on a non-virtual column obviously still requires us to read all the files to check for the condition.


    Michael Rys

    • Proposed as answer by Michael Amadi Thursday, September 1, 2016 7:24 PM
    • Marked as answer by Misinformed DNA Thursday, September 1, 2016 9:50 PM
    Thursday, September 1, 2016 7:09 PM
    Moderator
  • Thanks for the info. However, I take fault with the warning message.

    All final queries over stream sets should have a predicate on all virtual columns to enable partition elimination for better query performance.

    Telling me that I "should have a predicate on all virtual columns" makes me think that I am doing something wrong.

    Thursday, September 1, 2016 9:50 PM