SQL Server Developer Center > SQL Server Forums > Transact-SQL > GROUP BY does not use index effectively without isolating specific row.
Ask a questionAsk a question
 

AnswerGROUP BY does not use index effectively without isolating specific row.

  • Tuesday, November 03, 2009 12:59 PMCraig Main Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi,

    Given the table below, containing an integer field (SourceSystemKey) and an integer field (AsAtDateKey).
    I have created an index as follows:

    create index IX on SOMETABLE (SourceSystemKey, AsAtDateKey).

    When I run the following query:
    DECLARE @p0 Int SET @p0 = 1
    SELECT MAX([t0].[AsAtDateKey]) AS [value], SourceSystemKey
    FROM [SOMETABLE] AS [t0](NOLOCK)
    GROUP BY SourceSystemKey

    The query takes a really long time, and I can see from the execution plan that all 600 million rows are scanned to get this result.
    When I add (WHERE SourceSystemKey = 1) I get only one row returned, and the query is really fast. I can actually call the query separately ten times with each SourceSystemKey and get the result faster than the database group by.

    There are only about ten distinct SourceSystemKey entries in the table, so why would so many rows be returned by the scan.
    I can see the huge pipe of 600 million rows entering a stream aggregate. If I narrow it to one SourceSystem the query is a few milliseconds.

    How do I structure the index in such a way that the max(date) for each source system can be located really quickly. As I mention there are only ten such entries, so only ten max dates.

    Thanks 
    Regards
    Craig.


    CREATE TABLE [dbo].[SOMETABLE](
    [BKClientAccountBalanceKey] [int] IDENTITY(1,1) NOT NULL,
    [WMFundKey] [int] NOT NULL,
    [AsAtDateKey] [int] NOT NULL,
    [UpdateAuditKey] [int] NOT NULL,
    [SourceSystemKey] [int] NULL,
    [AccountServiceFeeExcludingVat] [float] NULL,
    [AccountServiceFeeIncludingVat] [float] NULL,
     CONSTRAINT [PK_fctWMAUA] PRIMARY KEY NONCLUSTERED 
    (
    [BKClientAccountBalanceKey] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 70) ON [PRIMARY]
    ) ON [PRIMARY]



    • Moved byTom PhillipsModeratorTuesday, November 03, 2009 1:31 PMTSQL Question (From:SQL Server Database Engine)
    •  

Answers

  • Wednesday, November 04, 2009 2:51 PMPesoMVPUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    1) Make a new clustered index over { SourceSystemKey ASC, AsAtDateKey ASC }
    2) Change the Fill Factor from 70% to 90%

    Now try again. Making the new clustered index can take a lot of time on 652 million records.

    Also, here are some other approaches to speed up your query.

    3) Have you considered partitioning? Both partitioned views and partitioned tables.
    4) If this is a query that is run very often, consider writing an after trigger to update an auxiliary table that holds two columns { SourceSystemKey, AsAtDateKey }. An insert will take a few milliseconds more, but the final query "Query 1" will run in milliseconds, not minutes (or hours).

All Replies

  • Tuesday, November 03, 2009 5:42 PMBrad_SchulzModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    When you don't add the WHERE clause, then yes, it does have to scan the entire table in order to do the counts.  However, I suspect it's not really scanning the table (i.e. clustered index) itself, but rather scanning the index you created on (SourceSystemKey, AsAtDateKey), and, since that index is sorted by SourceSystemKey, at least the optimizer did not need to SORT the data in order to do the MAX()... that would have been worse.

    When you add the WHERE clause, the optimizer can do a SEEK into the index you created for that one SourceSystemKey.

    Have you actually timed the 10 separate queries with WHERE clauses for the 10 different SourceSystemKeys vs the SCAN of the entire table?  If you count up the number of rows that each of those 10 separate queries had to process, you'll still end up with 600 million rows.

    --Brad (My Blog)
  • Wednesday, November 04, 2009 4:58 AMCraig Main Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi,

    Brad, thanks very much for the response, but I honestly believe something else is happening here.
    These two queries cannot be so drastically different. It doesn't make sense.

    Here are the client statistics for the two versions.
    1) (without where, single group query)

    The query plan uses an "Index Seek" into a "Stream Aggregate"
    The query takes 110859ms to complete.
    The "actual number of rows" in the Index Seek is 652 million.

    logical reads 1827978, physical 1113, scan count 98, read aheads 1806459

    2) Seek (with where)
    Using a union query for each group by, with the" where" set for each of the ten. A rediculous query, but one I am likely to use.

    The query plan uses "Index Seeks" into "Merge Joins"
    Each seek returns 1 row, "actual number of rows" is 1.

    logical reads 36, physical reads 0 (ZERO), scan count 9, read aheads 0 (ZERO).

    The entire query takes 31ms to complete, this compares to 110859 above. Extreme exponential difference.

    QUERY 1
    SELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]
    FROM [fctWMAUA] (NOLOCK) AS [t0]
    WHERE SourceSystemKey in (1,2,3,4,5,6,7,8,9)
    GROUP BY [t0].[SourceSystemKey]

    QUERY 2

    SELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]
    FROM [fctWMAUA] (NOLOCK) AS [t0]
    WHERE SourceSystemKey = 1
    GROUP BY [t0].[SourceSystemKey]
    UNION
    SELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]
    FROM [fctWMAUA] (NOLOCK) AS [t0]
    WHERE SourceSystemKey = 2
    GROUP BY [t0].[SourceSystemKey]
    UNION
    SELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]
    FROM [fctWMAUA] (NOLOCK) AS [t0]
    WHERE SourceSystemKey = 3
    GROUP BY [t0].[SourceSystemKey]
    UNION
    SELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]
    FROM [fctWMAUA] (NOLOCK) AS [t0]
    WHERE SourceSystemKey = 4
    GROUP BY [t0].[SourceSystemKey]
    UNION
    SELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]
    FROM [fctWMAUA] (NOLOCK) AS [t0]
    WHERE SourceSystemKey = 5
    GROUP BY [t0].[SourceSystemKey]
    UNION
    SELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]
    FROM [fctWMAUA] (NOLOCK) AS [t0]
    WHERE SourceSystemKey = 6
    GROUP BY [t0].[SourceSystemKey]
    UNION
    SELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]
    FROM [fctWMAUA] (NOLOCK) AS [t0]
    WHERE SourceSystemKey = 7
    GROUP BY [t0].[SourceSystemKey]
    UNION
    SELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]
    FROM [fctWMAUA] (NOLOCK) AS [t0]
    WHERE SourceSystemKey = 8
    GROUP BY [t0].[SourceSystemKey]
    UNION
    SELECT MAX([t0].[AsAtDateKey]) AS [Date], [t0].[SourceSystemKey] AS [SourceSystem]
    FROM [fctWMAUA] (NOLOCK) AS [t0]
    WHERE SourceSystemKey = 9
    GROUP BY [t0].[SourceSystemKey]
    They are both definitely seeking the same index in the query plan. I tripple checked.
    There just has to be a better way.
  • Wednesday, November 04, 2009 1:57 PMHunchbackMVP, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Craig,

    If you use UNION ALL you could get better response, SS will not need to sort to find duplicate.

    Can you post the execution plan for the first query?

    I would expect and index scan on the covering index (nonclustered index mentioned in previous post).


    AMB

  • Wednesday, November 04, 2009 2:51 PMPesoMVPUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    1) Make a new clustered index over { SourceSystemKey ASC, AsAtDateKey ASC }
    2) Change the Fill Factor from 70% to 90%

    Now try again. Making the new clustered index can take a lot of time on 652 million records.

    Also, here are some other approaches to speed up your query.

    3) Have you considered partitioning? Both partitioned views and partitioned tables.
    4) If this is a query that is run very often, consider writing an after trigger to update an auxiliary table that holds two columns { SourceSystemKey, AsAtDateKey }. An insert will take a few milliseconds more, but the final query "Query 1" will run in milliseconds, not minutes (or hours).