SQL Server Developer Center >
SQL Server Forums
>
Transact-SQL
>
GROUP BY does not use index effectively without isolating specific row.
GROUP BY does not use index effectively without isolating specific row.
- Hi,Given the table below, containing an integer field (SourceSystemKey) and an integer field (AsAtDateKey).I have created an index as follows:create index IX on SOMETABLE (SourceSystemKey, AsAtDateKey).When I run the following query:DECLARE @p0 Int SET @p0 = 1SELECT MAX([t0].[AsAtDateKey]) AS [value], SourceSystemKeyFROM [SOMETABLE] AS [t0](NOLOCK)GROUP BY SourceSystemKeyThe query takes a really long time, and I can see from the execution plan that all 600 million rows are scanned to get this result.When I add (WHERE SourceSystemKey = 1) I get only one row returned, and the query is really fast. I can actually call the query separately ten times with each SourceSystemKey and get the result faster than the database group by.There are only about ten distinct SourceSystemKey entries in the table, so why would so many rows be returned by the scan.I can see the huge pipe of 600 million rows entering a stream aggregate. If I narrow it to one SourceSystem the query is a few milliseconds.How do I structure the index in such a way that the max(date) for each source system can be located really quickly. As I mention there are only ten such entries, so only ten max dates.ThanksRegardsCraig.CREATE TABLE [dbo].[SOMETABLE]([BKClientAccountBalanceKey] [int] IDENTITY(1,1) NOT NULL,[WMFundKey] [int] NOT NULL,[AsAtDateKey] [int] NOT NULL,[UpdateAuditKey] [int] NOT NULL,[SourceSystemKey] [int] NULL,[AccountServiceFeeExcludingVat] [float] NULL,[AccountServiceFeeIncludingVat] [float] NULL,CONSTRAINT [PK_fctWMAUA] PRIMARY KEY NONCLUSTERED([BKClientAccountBalanceKey] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 70) ON [PRIMARY]) ON [PRIMARY]
- Moved byTom PhillipsModeratorTuesday, November 03, 2009 1:31 PMTSQL Question (From:SQL Server Database Engine)
Answers
- 1) Make a new clustered index over { SourceSystemKey ASC, AsAtDateKey ASC }
2) Change the Fill Factor from 70% to 90%
Now try again. Making the new clustered index can take a lot of time on 652 million records.
Also, here are some other approaches to speed up your query.
3) Have you considered partitioning? Both partitioned views and partitioned tables.
4) If this is a query that is run very often, consider writing an after trigger to update an auxiliary table that holds two columns { SourceSystemKey, AsAtDateKey }. An insert will take a few milliseconds more, but the final query "Query 1" will run in milliseconds, not minutes (or hours).- Marked As Answer byZongqing LiMSFT, ModeratorMonday, November 09, 2009 7:03 AM
All Replies
- When you don't add the WHERE clause, then yes, it does have to scan the entire table in order to do the counts. However, I suspect it's not really scanning the table (i.e. clustered index) itself, but rather scanning the index you created on (SourceSystemKey, AsAtDateKey), and, since that index is sorted by SourceSystemKey, at least the optimizer did not need to SORT the data in order to do the MAX()... that would have been worse.
When you add the WHERE clause, the optimizer can do a SEEK into the index you created for that one SourceSystemKey.
Have you actually timed the 10 separate queries with WHERE clauses for the 10 different SourceSystemKeys vs the SCAN of the entire table? If you count up the number of rows that each of those 10 separate queries had to process, you'll still end up with 600 million rows.
--Brad (My Blog) - Hi,Brad, thanks very much for the response, but I honestly believe something else is happening here.These two queries cannot be so drastically different. It doesn't make sense.
Here are the client statistics for the two versions.1) (without where, single group query)The query plan uses an "Index Seek" into a "Stream Aggregate"The query takes 110859ms to complete.The "actual number of rows" in the Index Seek is 652 million.logical reads 1827978, physical 1113, scan count 98, read aheads 18064592) Seek (with where)Using a union query for each group by, with the" where" set for each of the ten. A rediculous query, but one I am likely to use.The query plan uses "Index Seeks" into "Merge Joins"Each seek returns 1 row, "actual number of rows" is 1.logical reads 36, physical reads 0 (ZERO), scan count 9, read aheads 0 (ZERO).The entire query takes 31ms to complete, this compares to 110859 above. Extreme exponential difference.QUERY 1SELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]FROM [fctWMAUA] (NOLOCK) AS [t0]WHERE SourceSystemKey in (1,2,3,4,5,6,7,8,9)GROUP BY [t0].[SourceSystemKey]QUERY 2SELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]FROM [fctWMAUA] (NOLOCK) AS [t0]WHERE SourceSystemKey = 1GROUP BY [t0].[SourceSystemKey]UNIONSELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]FROM [fctWMAUA] (NOLOCK) AS [t0]WHERE SourceSystemKey = 2GROUP BY [t0].[SourceSystemKey]UNIONSELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]FROM [fctWMAUA] (NOLOCK) AS [t0]WHERE SourceSystemKey = 3GROUP BY [t0].[SourceSystemKey]UNIONSELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]FROM [fctWMAUA] (NOLOCK) AS [t0]WHERE SourceSystemKey = 4GROUP BY [t0].[SourceSystemKey]UNIONSELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]FROM [fctWMAUA] (NOLOCK) AS [t0]WHERE SourceSystemKey = 5GROUP BY [t0].[SourceSystemKey]UNIONSELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]FROM [fctWMAUA] (NOLOCK) AS [t0]WHERE SourceSystemKey = 6GROUP BY [t0].[SourceSystemKey]UNIONSELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]FROM [fctWMAUA] (NOLOCK) AS [t0]WHERE SourceSystemKey = 7GROUP BY [t0].[SourceSystemKey]UNIONSELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]FROM [fctWMAUA] (NOLOCK) AS [t0]WHERE SourceSystemKey = 8GROUP BY [t0].[SourceSystemKey]UNIONSELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]FROM [fctWMAUA] (NOLOCK) AS [t0]WHERE SourceSystemKey = 9GROUP BY [t0].[SourceSystemKey]They are both definitely seeking the same index in the query plan. I tripple checked.There just has to be a better way. - Craig,
If you use UNION ALL you could get better response, SS will not need to sort to find duplicate.
Can you post the execution plan for the first query?
I would expect and index scan on the covering index (nonclustered index mentioned in previous post).
AMB - 1) Make a new clustered index over { SourceSystemKey ASC, AsAtDateKey ASC }
2) Change the Fill Factor from 70% to 90%
Now try again. Making the new clustered index can take a lot of time on 652 million records.
Also, here are some other approaches to speed up your query.
3) Have you considered partitioning? Both partitioned views and partitioned tables.
4) If this is a query that is run very often, consider writing an after trigger to update an auxiliary table that holds two columns { SourceSystemKey, AsAtDateKey }. An insert will take a few milliseconds more, but the final query "Query 1" will run in milliseconds, not minutes (or hours).- Marked As Answer byZongqing LiMSFT, ModeratorMonday, November 09, 2009 7:03 AM


