Why is index scan generated to deal with TOP(10) subquery results?

Answered Why is index scan generated to deal with TOP(10) subquery results?

  • Thursday, February 07, 2013 8:59 PM
     
     

    Hi everyone,

    I don’t understand the following situation in SQL Server 2012, I’m hoping s/o can please explain..?

     I have a large table (25M records) of experimental time series data. I have a clustered index on the key columns DATENUM & ACTOR.  A logical column IS_CHECKPOINT identifies a recurring but unpredictable checkpoint event. I have to do regular queries to find the Nth most recent checkpoint for a given actor. The following query is very fast (2 ms), which returns in this case the 10 most recent checkpoint events:

    SELECT TOP(10) DATENUM FROM ExperimentalEvents

       WHERE ACTOR= 15 AND IS_CHECKPOINT = 1

       ORDER BY DATENUM DESC

    In SSMS, the estimated execution plan shows 98% cost for the clustered index seek & 2% for the TOP operation. This seems normal & validates the clustered index choice. However, the query isn’t giving me exactly what I want, which is a single value : the earliest DATENUM value of the 10 values selected. I had thought that a subquery would be the answer:

    SELECT TOP(1) DATENUM FROM ExperimentalEvents

           WHERE DATENUM IN

           (SELECT TOP(10) DATENUM FROM ExperimentalEvents

                 WHERE ACTOR= 15 AND IS_CHECKPOINT = 1

                 ORDER BY DATENUM DESC)

           ORDER BY DATENUM ASC


    This query takes dramatically longer to execute, on the order of 1000X longer (1600 ms)! The estimated execution plan is not at all what I expected:

    1st Independent execution branch:

    5% cost for the clustered index seek

    0% cost for the TOP operation

    3% cost for a sort

    2nd Independent execution branch:

    84% cost for a non-clustered index scan on 100,000 DATENUM values

    8% cost for inner join of the two branches

    I don’t understand why the query engine decided to do the DATENUM index scan? Surely it sees that the outer select statement is operating on the same variable DATENUM that’s being returned by the subquery?

    I recoded with DECLARE TABLE to aggregate the subquery results. The execution time is now acceptable (30 ms), but the parse & compile time is a deal-breaker : over 4000 ms!

    DECLARE @Top_N_Table TABLE (DATENUM float);

    INSERT @Top_N_Table

    SELECT TOP(10) DATENUM FROM ExperimentalEvents

       WHERE ACTOR= 15 AND IS_CHECKPOINT = 1

       ORDER BY DATENUM DESC;

    SELECT TOP(1) DATENUM FROM @Top_N_Table

           ORDER BY DATENUM ASC

     I guess I'll have to return the (N x 1) DATENUM vector, as the first code block above does, & let the external caller (MATLAB) sort it out. It just seems like there should be a globally most-efficient solution using T-SQL..?

    Any assistance appreciated,

    Thanks,


    Brad





    • Edited by BradStiritz Thursday, February 07, 2013 9:01 PM removed nonprinting superscript
    • Edited by BradStiritz Thursday, February 07, 2013 9:28 PM added DECLARE TABLE code & results
    • Edited by BradStiritz Thursday, February 07, 2013 9:40 PM added plan of action : return (N x 1) vector
    • Edited by BradStiritz Thursday, February 07, 2013 9:49 PM bolded key phrase : Nth most recent
    •  

All Replies

  • Thursday, February 07, 2013 9:43 PM
     
      Has Code

    Hi BradStiritz

    Could you not just do this? I may be missing the point?

    SELECT TOP(1) DATENUM FROM ExperimentalEvents
    WHERE ACTOR= 15 AND IS_CHECKPOINT = 1
    ORDER BY DATENUM DESC



    Pérez

  • Thursday, February 07, 2013 9:47 PM
     
     

    Hi Perez,

    Thanks for responding. I'm afraid you may have missed what I was describing: I need the *Nth* most-recent checkpoint. I believe your code returns the *most-recent* checkpoint.


    Brad

  • Thursday, February 07, 2013 9:49 PM
    Moderator
     
     
    What is the reason for such complex query if you can use ROW_NUMBER() approach instead?

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

  • Thursday, February 07, 2013 9:53 PM
     
     

    Hi Naomi,

    Thanks for your reply & suggestion. I wasn't aware of ROW_NUMBER() before. I'll look at some examples & report back here after I rewrite my query accordingly..


    Brad

  • Thursday, February 07, 2013 9:56 PM
     
      Has Code

    Hi BradStiritz

    Then do you mean:

    DECLARE @Nth_Date int = 3
    --=========================================================================
    DECLARE @table TABLE (mydate DATE)
    INSERT INTO @table VALUES
    ('01-01-2013'),('01-01-2012'),('01-01-2011'),('01-01-2010'),('01-01-2009'),
    ('01-01-2008'),('01-01-2007'),('01-01-2006'),('01-01-2005'),('01-01-2004')
    
    ;WITH CTE_GiveMeThatDate AS (
    SELECT mydate, ROW_NUMBER() OVER (ORDER BY mydate DESC) AS [rank] FROM @table
    )
    SELECT mydate FROM CTE_GiveMeThatDate
    WHERE rank = @Nth_Date


    Pérez

  • Thursday, February 07, 2013 10:00 PM
     
     Answered Has Code
    However, the query isn’t giving me exactly what I want, which is a single value : the earliest DATENUM value of the 10 values selected. I had thought that a subquery would be the answer:

    SELECT TOP(1) DATENUM FROM ExperimentalEvents

           WHERE DATENUM IN

           (SELECT TOP(10) DATENUM FROM ExperimentalEvents

                 WHERE ACTOR= 15 AND IS_CHECKPOINT = 1

                 ORDER BY DATENUM DESC)

           ORDER BY DATENUM ASC


    This is a very inefficient addition to your original query, because the outer query requires another access to the table, and a very inefficient one as well. The solution is way way simpler, see below:

    SELECT TOP(1) DATENUM
    FROM (
      SELECT TOP(10) DATENUM
      FROM ExperimentalEvents
      WHERE ACTOR= 15
      AND   IS_CHECKPOINT = 1
      ORDER BY DATENUM DESC
    ) AS X
    ORDER BY DATENUM ASC
    


    Gert-Jan

    • Marked As Answer by BradStiritz Friday, February 08, 2013 3:19 AM
    •  
  • Friday, February 08, 2013 3:32 AM
     
     

    Hi Gert-Jan,

    Thanks for your solution code! I did originally see the simple SELECT FROM (SELECT ..) as a natural potential solution, but I just couldn't get it to work. 

    So your secret ingredient is giving the subquery a dummy name (you called it 'X'), which is never used. I wonder what the magic is? If you or s/o else might comment please on exactly why the dummy 'AS' clause makes things work, I'd really appreciate it.

    Thanks again for helping me out with this!


    Brad

  • Friday, February 08, 2013 3:38 AM
    Moderator
     
     

    Brad,

    But did you test this version against ROW_NUMBER() query version? Which performed better?


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

  • Friday, February 08, 2013 4:43 PM
     
     

    Hi Naomi,

    I'm sorry, I haven't investigated your ROW_NUMBER() suggestion yet. I didn't immediately see how to adapt my code to use ROW_NUMBER(), whereas Gert-Jan posted a complete code solution which I was able to test right away. His solution runs in the same 2 ms time as the subquery by itself (for small N anyway), so it has to be very close to an optimal solution. I don't expect to ever run this for large N.

    I am interested in understanding ROW_NUMBER(), so I will try to take a look in the next week & post back here..

    Thanks again for posting your suggestion


    Brad

  • Friday, February 08, 2013 4:52 PM
    Moderator
     
     

    I just posted these links if you want to understand this method better:

    Optimizing TOP N per Group Queries - blog by Itzik Ben-Gan explaining various optimization ideas
    Including an Aggregated Column's Related Values - Erik and mine blog presenting several solutions of the problem with explanations for each
    Including an Aggregated Column's Related Values - Part 2 - my blog post with use cases for the previous blog

    I also think that this solution was already posted in this thread by Perez.


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

  • Friday, February 08, 2013 5:17 PM
     
     
    Thanks for the links, Naomi, I'll take a look. Regarding Perez' solution, it appears to me not to produce the desired results, as I explained to Perez..

    Brad

  • Friday, February 08, 2013 5:59 PM
     
      Has Code

    The query between braces is a derived table. Aliasing it (" AS X") is mandatory when you select from it. So just as you might write:

    SELECT a.my_column
    FROM my_table AS a
    

    the derived table will replace the actual table, like this:

    SELECT a.my_column
    FROM ( SELECT ... ) AS a

    You are free to choose whatever alias name you want (in the example above, it is "a"). Also, the keyword AS is optional. The difference is specifying an alias is not optional.

    FYI, the ROW_NUMBER() approach that Naomi refers to would look something like this:

    ;WITH numbered_rows AS (
        SELECT DATENUM
        ,      ROW_NUMBER() OVER (ORDER BY DATENUM DESC) AS rn
        FROM ExperimentalEvents
        WHERE ACTOR = 15
        AND   IS_CHECKPOINT = 1
    )
    SELECT DATENUM
    FROM numbered_rows
    WHERE rn = 10  -- replace with 1 for the most recent
    
    I expect this approach to work very efficiently as well. It will probably tie (from a performance point of view) with the solution I posted earlier.

    Gert-Jan

  • Friday, February 08, 2013 6:46 PM
    Moderator
     
     
    I do not see your explanation. I meant his second query which is just slightly different than the first and using Nth for selection.

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

  • Friday, February 08, 2013 6:54 PM
     
     

    Thanks for your follow-up comments, Gert-Jan, they're very appreciated.

    Best wishes,


    Brad

  • Friday, February 08, 2013 6:56 PM
     
     
    The second solution Perez posted requires DECLARE TABLE. As I mentioned in my original post, this appears to increase the parse & compile time dramatically, which disqualifies it from consideration..

    Brad

  • Friday, February 08, 2013 6:59 PM
    Moderator
     
     

    Brad,

    Sorry, I have to laugh. The declare table part is not relevant to the solution, the solution begins from creating a CTE (;with statement).

    Declare table part is used to replicate your environment, just to demonstrate the solution. You need to change references to @t to your own table and you don't need this declare part at all for the actual solution.

    ---------

    It was a time I added a comment - Solution begins from here just to avoid such misunderstanding from people taking everything too literally.


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog