none
extremely slow group by on int, 750K recs, sql server 2012 RRS feed

  • Question

  • I'm having an issue with the below t-sql code and suggestions are welcome.  Everything without the final GROUP BY and SUM clauses (in the last six lines), runs in 15 seconds, producing 753,000 rows.  However the GROUP BY (on an int) and SUM cause it to stretch out to a query lasting two full days (48 hours), and not yet completed.  I assume there's some way of speeding it, but do I have to scrap this entire approach and switch to an entirely different method, even row by row, instead (that is, should I expect a group by on 750,000 rows to take that long, or has something gone wrong)?  I can attach the execution plans; however, for the version with "group by" I can only give the estimated execution plan, as it never finishes running.  Thanks.

    WITH cteHr AS ( SELECT rt.ResTradeId, rt.DollarVolPerTick, p.Dtm, p.Price, ROW_NUMBER() OVER(PARTITION BY rt.ResTradeId ORDER BY p.Dtm)
    AS RowNum FROM dbo.ResBase AS r JOIN dbo.ResTrade AS rt ON rt.ResId = r.ResId JOIN dbo.DailyBase AS d ON (d.ComId=rt.ComId) AND (d.Ddte BETWEEN rt.EntryDdte AND rt.ExitDdte) JOIN dbo.PriceBase AS p ON (d.ComId= p.ComId) AND (d.Ddte = p.Ddte) AND (p.ComId = rt.ComId) AND (p.Dtm BETWEEN rt.EntryDtm AND rt.ExitDtm) AND ( ( (DATEPART(minute, p.Dtme)) = (DATEPART(minute, CAST((DATEADD(hour, 6, rt.EntryDtm)) AS TIME(0)))) ) OR ( p.Dtm=rt.ExitDtm ) ) WHERE r.fRun = 1 ) , cteHrDevObs AS ( SELECT R.ResTradeId, ABS(R.Price-L.Price)*R.DollarVolPerTick AS HrPriceChg, CASE (DATEDIFF(minute, L.Dtm, R.Dtm) % 60) WHEN 0 THEN 1 ELSE (CAST((DATEDIFF(minute, L.Dtm, R.Dtm) % 60) AS FLOAT)/ 60) END AS DurationHrSlice FROM cteHr AS R JOIN cteHr AS L ON R.ResTradeId = L.ResTradeId AND R.RowNum = L.RowNum + 1 ) SELECT ResTradeId, SUM(HrPriceChg * DurationHrSlice) AS HrDev, SUM(DurationHrSlice) AS DurationHrs FROM cteHrDevObs GROUP BY ResTradeId;


    Wednesday, July 25, 2012 2:25 AM

Answers

  • Confused and confused... A CTE is just like macro which is expanded into the query. Which means that if you refer to the CTE multiple times in the query, the CTE will be computed multiple times. Other products are capable to materialise the CTE, but SQL Server is not.

    Thus, your approach of using an intermediate table was the correct solution. Particularly if you add a primary key to support the join.

    Note that if you are on SQL 2012, there is no need for the self-join over the row number, but you could use lag() instead to get a value from the previous row.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Wednesday, July 25, 2012 8:24 AM
  • ok, never mind, I got it down from 48 hrs plus to 18 seconds by using an intermediate table.  obviously sql server is getting seriously confused here.  anything to do (statistics are up to date so must be something else).

    TRUNCATE TABLE dbo.ztmpResHrDev;
    
    WITH cteHr AS 
    (
    	SELECT rt.ResTradeId, rt.DollarVolPerTick, p.Dtm, p.Price, 
    		ROW_NUMBER() OVER(PARTITION BY rt.ResTradeId ORDER BY p.Dtm) AS RowNum
    	FROM 
    		dbo.ResBase AS r JOIN
    		dbo.ResTrade AS rt ON rt.ResId = r.ResId JOIN
    		dbo.DailyBase AS d ON   
    			(d.ComId=rt.ComId) AND 
    			(d.Ddte BETWEEN rt.EntryDdte AND rt.ExitDdte) JOIN
    		dbo.PriceBase AS p ON 
    			(d.ComId= p.ComId) AND
    			(d.Ddte = p.Ddte) AND
    			(p.ComId = rt.ComId) AND
    			(p.Dtm BETWEEN rt.EntryDtm AND rt.ExitDtm) AND
    			(   
    				(	(DATEPART(minute, p.Dtme)) =
    					(DATEPART(minute, CAST((DATEADD(hour, 6, rt.EntryDtm)) AS TIME(0)))) 
    				)
    				OR
    				(	p.Dtm=rt.ExitDtm  				)
    			)
    	WHERE
    		r.fRun = 1 
    )
    	INSERT INTO dbo.ztmpResHrDev (ResTradeId, HrPriceChg, DurationHrSlice)
    	SELECT 
    		R.ResTradeId, 
    			ABS(R.Price-L.Price)*R.DollarVolPerTick AS 
    		HrPriceChg,
    			CASE (DATEDIFF(minute, L.Dtm, R.Dtm) % 60)
    				WHEN 0 THEN 1
    				ELSE (CAST((DATEDIFF(minute, L.Dtm, R.Dtm) % 60) AS FLOAT)/ 60)
    					--for last frac hour.
    			END AS 
    		DurationHrSlice  
    					     			
    	FROM 
    		cteHr AS R JOIN
    		cteHr AS L ON R.ResTradeId = L.ResTradeId AND R.RowNum = L.RowNum + 1;
    
    
    
    	SELECT 
    		ResTradeId, 
    			SUM(HrPriceChg * DurationHrSlice) AS
    		HrDev,
    			SUM(DurationHrSlice) AS 
    		DurationHrs  
    	FROM ztmpResHrDev
        GROUP BY ResTradeId;
    

    • Marked as answer by TechVsLife2 Wednesday, July 25, 2012 5:23 PM
    Wednesday, July 25, 2012 3:37 AM

All replies

  • ok, never mind, I got it down from 48 hrs plus to 18 seconds by using an intermediate table.  obviously sql server is getting seriously confused here.  anything to do (statistics are up to date so must be something else).

    TRUNCATE TABLE dbo.ztmpResHrDev;
    
    WITH cteHr AS 
    (
    	SELECT rt.ResTradeId, rt.DollarVolPerTick, p.Dtm, p.Price, 
    		ROW_NUMBER() OVER(PARTITION BY rt.ResTradeId ORDER BY p.Dtm) AS RowNum
    	FROM 
    		dbo.ResBase AS r JOIN
    		dbo.ResTrade AS rt ON rt.ResId = r.ResId JOIN
    		dbo.DailyBase AS d ON   
    			(d.ComId=rt.ComId) AND 
    			(d.Ddte BETWEEN rt.EntryDdte AND rt.ExitDdte) JOIN
    		dbo.PriceBase AS p ON 
    			(d.ComId= p.ComId) AND
    			(d.Ddte = p.Ddte) AND
    			(p.ComId = rt.ComId) AND
    			(p.Dtm BETWEEN rt.EntryDtm AND rt.ExitDtm) AND
    			(   
    				(	(DATEPART(minute, p.Dtme)) =
    					(DATEPART(minute, CAST((DATEADD(hour, 6, rt.EntryDtm)) AS TIME(0)))) 
    				)
    				OR
    				(	p.Dtm=rt.ExitDtm  				)
    			)
    	WHERE
    		r.fRun = 1 
    )
    	INSERT INTO dbo.ztmpResHrDev (ResTradeId, HrPriceChg, DurationHrSlice)
    	SELECT 
    		R.ResTradeId, 
    			ABS(R.Price-L.Price)*R.DollarVolPerTick AS 
    		HrPriceChg,
    			CASE (DATEDIFF(minute, L.Dtm, R.Dtm) % 60)
    				WHEN 0 THEN 1
    				ELSE (CAST((DATEDIFF(minute, L.Dtm, R.Dtm) % 60) AS FLOAT)/ 60)
    					--for last frac hour.
    			END AS 
    		DurationHrSlice  
    					     			
    	FROM 
    		cteHr AS R JOIN
    		cteHr AS L ON R.ResTradeId = L.ResTradeId AND R.RowNum = L.RowNum + 1;
    
    
    
    	SELECT 
    		ResTradeId, 
    			SUM(HrPriceChg * DurationHrSlice) AS
    		HrDev,
    			SUM(DurationHrSlice) AS 
    		DurationHrs  
    	FROM ztmpResHrDev
        GROUP BY ResTradeId;
    

    • Marked as answer by TechVsLife2 Wednesday, July 25, 2012 5:23 PM
    Wednesday, July 25, 2012 3:37 AM
  • Confused and confused... A CTE is just like macro which is expanded into the query. Which means that if you refer to the CTE multiple times in the query, the CTE will be computed multiple times. Other products are capable to materialise the CTE, but SQL Server is not.

    Thus, your approach of using an intermediate table was the correct solution. Particularly if you add a primary key to support the join.

    Note that if you are on SQL 2012, there is no need for the self-join over the row number, but you could use lag() instead to get a value from the previous row.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Wednesday, July 25, 2012 8:24 AM
  • Thanks Erland, I didn't know about the cte repetition without materialization.  In this case I was assuming the issue was not the cte expansion (since the cte itself takes a few seconds to run), but some mixup in how to do sorting needed for the groupby (the intermediate table sorts for sql server, precluding it from devising sinister query strategies).  But perhaps sql was doing something truly dreadful like running the whole cte over again for each int value (a cte of 750K recs times at least 32,000 int values=24 billion recs)--I don't want to think such things.  I took a quick look at the bad estimated plan with sum/groupby (ctes and no intermediate table) and sql server thought there was only ONE record involved.  estimated 1 rec vs actual 750,000 recs will probably not lead to rational query strategies. 

    The usefulness of intermediate tables should figure a lot more prominently in query optimization books; at least in my mind going from over 48 hours down to 18 seconds is a big improvement for a minor change (though unfortunately it means using sprocs rather than inline table functions, so less pretty and clean).  For others reading, I didn't put a primary key on the intermediate table ("ztmpResHrDev" in the code) since the relevant field in the groupby ("ResTradeId") is non-unique, but I put a clustered index on that field, and it did the trick.

    I'm not up to speed on the newest window functions yet, but I have Itzik Ben-Gan's excellent new book on them (by MS Press), and am starting to use them.


    • Edited by TechVsLife2 Wednesday, July 25, 2012 5:26 PM
    Wednesday, July 25, 2012 5:21 PM
  • Great to hear that you are reading Itzik's book. I only came around to order it the other day!

    You can replace L.Dtm with

    with LAG(p.Dtm, 1) OVER (PARTITION BY rt.ResTradeId ORDER p.Dtm

    Note that LAG returns NULL for the first row (unless you specify something else for the third argument). Currently you are dropping this row entirely since you are inner-joining the CTE:s to each other.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Wednesday, July 25, 2012 7:35 PM