Why is index scan generated to deal with TOP(10) subquery results?
-
Thursday, February 07, 2013 8:59 PM
Hi everyone,
I don’t understand the following situation in SQL Server 2012, I’m hoping s/o can please explain..?
I have a large table (25M records) of experimental time series data. I have a clustered index on the key columns DATENUM & ACTOR. A logical column IS_CHECKPOINT identifies a recurring but unpredictable checkpoint event. I have to do regular queries to find the Nth most recent checkpoint for a given actor. The following query is very fast (2 ms), which returns in this case the 10 most recent checkpoint events:
SELECT TOP(10) DATENUM FROM ExperimentalEvents
WHERE ACTOR= 15 AND IS_CHECKPOINT = 1
ORDER BY DATENUM DESC
In SSMS, the estimated execution plan shows 98% cost for the clustered index seek & 2% for the TOP operation. This seems normal & validates the clustered index choice. However, the query isn’t giving me exactly what I want, which is a single value : the earliest DATENUM value of the 10 values selected. I had thought that a subquery would be the answer:
SELECT TOP(1) DATENUM FROM ExperimentalEvents
WHERE DATENUM IN
(SELECT TOP(10) DATENUM FROM ExperimentalEvents
WHERE ACTOR= 15 AND IS_CHECKPOINT = 1
ORDER BY DATENUM DESC)
ORDER BY DATENUM ASC
This query takes dramatically longer to execute, on the order of 1000X longer (1600 ms)! The estimated execution plan is not at all what I expected:
1st Independent execution branch:
5% cost for the clustered index seek
0% cost for the TOP operation
3% cost for a sort
2nd Independent execution branch:
84% cost for a non-clustered index scan on 100,000 DATENUM values
8% cost for inner join of the two branches
I don’t understand why the query engine decided to do the DATENUM index scan? Surely it sees that the outer select statement is operating on the same variable DATENUM that’s being returned by the subquery?
I recoded with DECLARE TABLE to aggregate the subquery results. The execution time is now acceptable (30 ms), but the parse & compile time is a deal-breaker : over 4000 ms!
DECLARE @Top_N_Table TABLE (DATENUM float);
INSERT @Top_N_Table
SELECT TOP(10) DATENUM FROM ExperimentalEvents
WHERE ACTOR= 15 AND IS_CHECKPOINT = 1
ORDER BY DATENUM DESC;
SELECT TOP(1) DATENUM FROM @Top_N_Table
ORDER BY DATENUM ASC
I guess I'll have to return the (N x 1) DATENUM vector, as the first code block above does, & let the external caller (MATLAB) sort it out. It just seems like there should be a globally most-efficient solution using T-SQL..?
Any assistance appreciated,
Thanks,
Brad
- Edited by BradStiritz Thursday, February 07, 2013 9:01 PM removed nonprinting superscript
- Edited by BradStiritz Thursday, February 07, 2013 9:28 PM added DECLARE TABLE code & results
- Edited by BradStiritz Thursday, February 07, 2013 9:40 PM added plan of action : return (N x 1) vector
- Edited by BradStiritz Thursday, February 07, 2013 9:49 PM bolded key phrase : Nth most recent
All Replies
-
Thursday, February 07, 2013 9:43 PM
Hi BradStiritz
Could you not just do this? I may be missing the point?
SELECT TOP(1) DATENUM FROM ExperimentalEvents WHERE ACTOR= 15 AND IS_CHECKPOINT = 1 ORDER BY DATENUM DESC
Pérez
-
Thursday, February 07, 2013 9:47 PM
Hi Perez,
Thanks for responding. I'm afraid you may have missed what I was describing: I need the *Nth* most-recent checkpoint. I believe your code returns the *most-recent* checkpoint.
Brad
-
Thursday, February 07, 2013 9:49 PMModeratorWhat is the reason for such complex query if you can use ROW_NUMBER() approach instead?
For every expert, there is an equal and opposite expert. - Becker's Law
My blog -
Thursday, February 07, 2013 9:53 PM
Hi Naomi,
Thanks for your reply & suggestion. I wasn't aware of ROW_NUMBER() before. I'll look at some examples & report back here after I rewrite my query accordingly..
Brad
-
Thursday, February 07, 2013 9:56 PM
Hi BradStiritz
Then do you mean:
DECLARE @Nth_Date int = 3 --========================================================================= DECLARE @table TABLE (mydate DATE) INSERT INTO @table VALUES ('01-01-2013'),('01-01-2012'),('01-01-2011'),('01-01-2010'),('01-01-2009'), ('01-01-2008'),('01-01-2007'),('01-01-2006'),('01-01-2005'),('01-01-2004') ;WITH CTE_GiveMeThatDate AS ( SELECT mydate, ROW_NUMBER() OVER (ORDER BY mydate DESC) AS [rank] FROM @table ) SELECT mydate FROM CTE_GiveMeThatDate WHERE rank = @Nth_DatePérez
-
Thursday, February 07, 2013 10:00 PM
However, the query isn’t giving me exactly what I want, which is a single value : the earliest DATENUM value of the 10 values selected. I had thought that a subquery would be the answer:
SELECT TOP(1) DATENUM FROM ExperimentalEvents
WHERE DATENUM IN
(SELECT TOP(10) DATENUM FROM ExperimentalEvents
WHERE ACTOR= 15 AND IS_CHECKPOINT = 1
ORDER BY DATENUM DESC)
ORDER BY DATENUM ASC
This is a very inefficient addition to your original query, because the outer query requires another access to the table, and a very inefficient one as well. The solution is way way simpler, see below:
SELECT TOP(1) DATENUM FROM ( SELECT TOP(10) DATENUM FROM ExperimentalEvents WHERE ACTOR= 15 AND IS_CHECKPOINT = 1 ORDER BY DATENUM DESC ) AS X ORDER BY DATENUM ASC
Gert-Jan
- Marked As Answer by BradStiritz Friday, February 08, 2013 3:19 AM
-
Friday, February 08, 2013 3:32 AM
Hi Gert-Jan,
Thanks for your solution code! I did originally see the simple SELECT FROM (SELECT ..) as a natural potential solution, but I just couldn't get it to work.
So your secret ingredient is giving the subquery a dummy name (you called it 'X'), which is never used. I wonder what the magic is? If you or s/o else might comment please on exactly why the dummy 'AS' clause makes things work, I'd really appreciate it.
Thanks again for helping me out with this!
Brad
-
Friday, February 08, 2013 3:38 AMModerator
Brad,
But did you test this version against ROW_NUMBER() query version? Which performed better?
For every expert, there is an equal and opposite expert. - Becker's Law
My blog -
Friday, February 08, 2013 4:43 PM
Hi Naomi,
I'm sorry, I haven't investigated your ROW_NUMBER() suggestion yet. I didn't immediately see how to adapt my code to use ROW_NUMBER(), whereas Gert-Jan posted a complete code solution which I was able to test right away. His solution runs in the same 2 ms time as the subquery by itself (for small N anyway), so it has to be very close to an optimal solution. I don't expect to ever run this for large N.
I am interested in understanding ROW_NUMBER(), so I will try to take a look in the next week & post back here..
Thanks again for posting your suggestion
Brad
-
Friday, February 08, 2013 4:52 PMModerator
I just posted these links if you want to understand this method better:
Optimizing TOP N per Group Queries - blog by Itzik Ben-Gan explaining various optimization ideas
Including an Aggregated Column's Related Values - Erik and mine blog presenting several solutions of the problem with explanations for each
Including an Aggregated Column's Related Values - Part 2 - my blog post with use cases for the previous blogI also think that this solution was already posted in this thread by Perez.
For every expert, there is an equal and opposite expert. - Becker's Law
My blog -
Friday, February 08, 2013 5:17 PMThanks for the links, Naomi, I'll take a look. Regarding Perez' solution, it appears to me not to produce the desired results, as I explained to Perez..
Brad
-
Friday, February 08, 2013 5:59 PM
The query between braces is a derived table. Aliasing it (" AS X") is mandatory when you select from it. So just as you might write:
SELECT a.my_column FROM my_table AS a
the derived table will replace the actual table, like this:
SELECT a.my_column FROM ( SELECT ... ) AS a
You are free to choose whatever alias name you want (in the example above, it is "a"). Also, the keyword AS is optional. The difference is specifying an alias is not optional.
FYI, the ROW_NUMBER() approach that Naomi refers to would look something like this:
;WITH numbered_rows AS ( SELECT DATENUM , ROW_NUMBER() OVER (ORDER BY DATENUM DESC) AS rn FROM ExperimentalEvents WHERE ACTOR = 15 AND IS_CHECKPOINT = 1 ) SELECT DATENUM FROM numbered_rows WHERE rn = 10 -- replace with 1 for the most recentI expect this approach to work very efficiently as well. It will probably tie (from a performance point of view) with the solution I posted earlier.Gert-Jan
-
Friday, February 08, 2013 6:46 PMModeratorI do not see your explanation. I meant his second query which is just slightly different than the first and using Nth for selection.
For every expert, there is an equal and opposite expert. - Becker's Law
My blog -
Friday, February 08, 2013 6:54 PM
Thanks for your follow-up comments, Gert-Jan, they're very appreciated.
Best wishes,
Brad
-
Friday, February 08, 2013 6:56 PMThe second solution Perez posted requires DECLARE TABLE. As I mentioned in my original post, this appears to increase the parse & compile time dramatically, which disqualifies it from consideration..
Brad
-
Friday, February 08, 2013 6:59 PMModerator
Brad,
Sorry, I have to laugh. The declare table part is not relevant to the solution, the solution begins from creating a CTE (;with statement).
Declare table part is used to replicate your environment, just to demonstrate the solution. You need to change references to @t to your own table and you don't need this declare part at all for the actual solution.
---------
It was a time I added a comment - Solution begins from here just to avoid such misunderstanding from people taking everything too literally.
For every expert, there is an equal and opposite expert. - Becker's Law
My blog

