none
Optimizer Uses Index Scan instead of Seek for MAX on Primary Key Column

    問題

  • Hi

    I have a table that is partitioned by month on a datetime column, and hence this is the clustered index. There is an identity column that is the primary key, and of course a non-clustered index. When I check the plan for this

        SELECT MAX(Id) FROM MyTable

    it shows an index scan on the PK, and as the table has millions of rows it can take up to a minute to return.

    Shouldn't this be really quick, as MAX will always be the last row in the PK index?

    TIA

    Charles

    • 已移動 Tom PhillipsModerator 2012年3月6日 下午 10:21 possibly better answer from TSQL forum (From:SQL Server Database Engine)
    2012年3月6日 下午 04:27

解答

  • Hi Charles,

    I've just replied to Hugo that the query is selecting MAX of the identifier column, so that probably means that the sub-select will not help?

    It depends - it could help.

    Since the identifier column is an identity column, it will always
    increase (assuming you don't exxplicitly tamper with it). If (and
    that's a big if that only you can address) the values in the datetime
    column also always increase, for instance because you always use
    CURRENT_TIMESTAMP when inserting rows and never change the values in
    this column (or tamper with the system clock), then it would work,
    since the combination of these ingeredients ensures that the highest
    identifier value is always in the same row as the highest date value.

    On the other hand, given that the column has the identity property,
    have you considered using the IDENNT_CURRENT function?


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    • 已標示為解答 ckl42 2012年3月7日 下午 01:15
    2012年3月7日 下午 12:14

所有回覆

  • are you saying the PK column(id) is a non clustered index

    vt


    Please mark answered if I've answered your question and vote for it as helpful to help other user's find a solution quicker

    2012年3月6日 下午 04:42
  • Because your query does not have a where clause.. hence it go for scan .if you change the code slightly and add where clause

    select max(id) from mytable where id>100

    this time I believe it would use seek..

    VT


    Please mark answered if I've answered your question and vote for it as helpful to help other user's find a solution quicker

    2012年3月6日 下午 04:51
  • Yes, that's right.

    Charles

    2012年3月6日 下午 05:19
  • Hi VT

    As it happens I saw that suggested in another post before I posted my question, so I tried it. Sure enough, it changed to a seek, but took even longer to execute.

    Perhaps then, it is not so much a matter of scan vs seek, as to why is it slow when all it has to do is get the last row from the non-clustered index?

    Charles

    2012年3月6日 下午 05:22
  • Not sure but the below link may be helpful:

    http://support.microsoft.com/kb/973255

    - Krishnakumar S

    2012年3月6日 下午 08:53
  • Did you also try

    select top (1) ID from myTable ORDER BY ID DESC?

    If the index on ID doesn't have DESC, then this query may be also slow.


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    2012年3月6日 下午 10:45
  • Hi Krishnakumar

    I should have said that I am using 2008 R2 Enterprise. I think that the fix you linked to is incorporated, unless MS have re-introduced it :-(.

    Charles

    2012年3月7日 上午 12:39
  • Hi Naomi

    I did try that, yes. You are right that performance was worse.

    There is a curious thing: when I run the analyzer, it recommends a non-clustered index on the Id column, and declares that I will see a 99% increase in performance. But I already have a non-clustered index on this column, supporting the PK, so I fail to see why I don't already have that performance.

    In reality, the non-clustered index is on Id, datetimeColumn (it is a partitioned table, on datetimeColumn) but, nevertheless, the first column in the list is Id. I would try adding the recommended index as an experiment, but it tells me that it will occupy 5Gb, and I don't have that space for an experimental index.

    Charles

    2012年3月7日 上午 12:46
  • Can we change this index to use ID in DESC order? If you always look for the last ID, having DESC order in the Index may help.

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    2012年3月7日 上午 01:06
  • I agree that it might help, but perhaps only in the same way as adding a new index on Id ASC, as the optimizer suggests. Changing the index from ASC to DESC would be a costly operation - on a live table - and unfortunately I don't have that luxury without requesting a maintenance window. I'd really like to understand why the existing index doesn't work efficiently, since ASC or DESC is only the difference between looking at one end or the other of the index.

    Charles

    2012年3月7日 上午 01:27
  • Are you able to get a backup of that database installed on another server (development) and test that change there?

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    2012年3月7日 上午 01:30
  • Hi ckl42,

    The scan is expected here. A seek is used to find one specific value
    in an index, which can be the only value or the start of a range (in
    which case the seek will give the engine the starting point of what I
    call a partial scan - but that is not an official term).

    For a MIN() or MAX() query on an indexed column, you would usually get
    a SCAN operator, but you would also get a TOP operator in the plan,
    that ensures that the SCAN is executed only once, to read the first
    row of the index (or the last, as an index can be read in both
    directions).

    I am concenred about the eexecution times, though. Can you script the
    CREATE TABLE script for the table from SSMS (with the option to
    include indexes checked) and post the result here?


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    2012年3月7日 上午 08:16
  • Unfortunately, it is very unlikely that we could do that because of the sheer size of the database. In addition, it is in a production environment that we do not have that level of control over.

    Charles

    2012年3月7日 上午 09:19
  • Hi Hugo

    Here is the script:

    CREATE TABLE [dbo].[MyTable](
     [MyTableId] [bigint] IDENTITY(1,1) NOT NULL,
     [MyTableTime] [datetime] NOT NULL,
     [MyIdentifier] [varchar](15) NOT NULL,
     [ValidRecord] [bit] NOT NULL,
     [CreatedDate] [datetime] NOT NULL,
     CONSTRAINT [PK_MyTable] PRIMARY KEY NONCLUSTERED
    (
     [MyTableId] ASC,
     [MyTableTime] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [MonthPartitionScheme]([MyTableTime])
    ) ON [MonthPartitionScheme]([MyTableTime])
    GO
    SET ANSI_PADDING OFF
    GO
    CREATE CLUSTERED INDEX [IX_MyTable_MyTableTime] ON [dbo].[MyTable]
    (
     [MyTableTime] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [MonthPartitionScheme]([MyTableTime])
    GO
    CREATE NONCLUSTERED INDEX [IX_MyTable_CreatedDate_MyIdentifier] ON [dbo].[MyTable]
    (
     [CreatedDate] ASC,
     [MyIdentifier] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [MonthPartitionScheme]([MyTableTime])
    GO
    CREATE NONCLUSTERED INDEX [IX_MyTable_MyIdentifier_MyTableTime] ON [dbo].[MyTable]
    (
     [MyIdentifier] ASC,
     [MyTableTime] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [MonthPartitionScheme]([MyTableTime])
    GO
    CREATE UNIQUE NONCLUSTERED INDEX [UQ_MyTable_MyTableTime_MyIdentifier_ValidRecord] ON [dbo].[MyTable]
    (
     [MyTableTime] ASC,
     [MyIdentifier] ASC,
     [ValidRecord] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [MonthPartitionScheme]([MyTableTime])
    GO

    Charles

    2012年3月7日 上午 09:44
  • Hi ckl42,

    The scan is expected here.

    Indeed. I ran this little code (without an index):

    DECLARE @EXAMPLE TABLE (ID INT);
    
    INSERT INTO @EXAMPLE (ID) VALUES (0);
    INSERT INTO @EXAMPLE (ID) VALUES (1);
    INSERT INTO @EXAMPLE (ID) VALUES (2);
    
    SELECT MAX(ID) FROM @EXAMPLE;

    You say the server:
    1. Retrive every ID from the table
    2. Aggregate all the values you just retrieved to find the highest.

    If you take this code:

    DECLARE @EXAMPLE TABLE (ID INT);
    
    INSERT INTO @EXAMPLE (ID) VALUES (0);
    INSERT INTO @EXAMPLE (ID) VALUES (1);
    INSERT INTO @EXAMPLE (ID) VALUES (2);
    
    SELECT TOP 1 ID FROM @EXAMPLE ORDER BY ID DESC;

    You say the sever:
    1. Retrieve every ID from the table
    2. Order the result desceding and give me the highest.

    All a index really can do here, is to give you a smaller table to scan. But finding the biggest from N values, means actually comparing all N values (unless you have it saved someplace else and a bright enough to remember how to retreive it).

    When this is really only about finding the last inserted ID, how about making a seperate table with only one row + a Trigger on thsi table that set's this one row to the last inserted every time an value is inserted? I am not 100% certain how triggers works, but I guess they would still be in the implicit transaction of the insert.

    2012年3月7日 上午 10:30
  • Basically the stream aggreggate physical operator for MAX , MIN, AVG etc. expects a group of rows as input for aggreggate calculation and outputs a scalar aggreggate (if no grouping specified) or vector aggreggate (if grouping columns specified). If your table has 100,000 rows and you have an index on RowNumber column  then when you execute SELECT MAX(RowNumber) FROM YourTable it will go for an Index Scan and input the row(s) to TOP operator. The single output row from TOP is then supplied to Stream Aggregate operator. If you have a unique index then the TOP operator can be avoided since the rows will be unique. When you execute a query like SELECT MAX(RowNumber) FROM YourTable WHERE RowNumber > 1000 then there is no need for scan and query processor uses index seek to input the rows to TOP or Stream Aggregate based on the uniquness of the predicate or index column value.

    This is what I think happening for aggreggate operations.

    - Krishnakumar S

    2012年3月7日 上午 10:34
  • Hi Charles,

    Thanks for the script. I assume that the Id column from the original
    post is the MyIdentifier column, and that the index used in the query
    is this one:

    CREATE NONCLUSTERED INDEX [IX_MyTable_CreatedDate_MyIdentifier] ON [dbo].[MyTable]
    (
     [CreatedDate] ASC,
     [MyIdentifier] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [MonthPartitionScheme]([MyTableTime])

    I see that you partitioned this index on the same partition scheme as
    the table. This means that, basically, for every partition you have,
    you have a seperate index for only the values in the appropriate
    range. And the highest MyIdentifier value could be in every one of
    those ranges.

    This gives the query optimizer a few options to choose from:

    1. Do an ordered, backwards scan, combined with a TOP operator (to
    fetch only a single row) for each of the partitions, then aggregate
    the results of all those partitions to find the final maximum value;

    2. Do a full scan of the table; feed the results to an aggregate
    operator.

    I don't know which option the optimizer picked for you, and I can't
    find out (I could create the table and load a few rows of data, but
    that would not help - choices like this are based on the amount of
    data, and there's no way I'll spend the time to load the table with
    millions of rows!)

    You could consider creating a query on the MyIdentifier column only,
    and creating that index on a filegroup instead of on the partition
    scheme. That should speed up this query. But since each extra index
    adds overhead, you should consider other options first. For instance,
    if you know that the highest MyIdentifier value is always bound to a
    recennt date (because of how values are generated), you could include a date filter in the query:

    SELECT MAX(MyIdentifier)
    FROM MyTable
    WHERE MyTableTime > '2012-06-03';

    That way, SQL Server knows that only the last partition has to be
    checked.

    You should also consider just accepting that this query is slow (which
    is obviously only viable if the query is used very infrequently).


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    2012年3月7日 上午 11:35
  • Try this:

    SELECT MAX(MyIdentifier)
    FROM dbo.MyTable
    WHERE MyTableTime = ( SELECT MAX(MyTableTime) FROM dbo.MyTable )

    This is semantically different from simply SELECT MAX... so may not meet your requirements.  I guess it depends on how MyTableTime is created.  Is it correlated to MyIdentifier (in which case my solution work) or not (in which case it won't!).
    • 已編輯 wBob 2012年3月7日 上午 11:51 semantic difference
    2012年3月7日 上午 11:43
  • Hi Christopher84,

    All a index really can do here, is to give you a smaller table to scan. But finding the biggest from N values, means actually comparing all N values (unless you have it saved someplace else and a bright enough to remember how to retreive it).

    That is actually not correct. If there is more data in the table, the
    optimizer will find a plan that has to access two pages only (one
    internal page to find the first page with real data, and that page).

    Try this repro:

    CREATE TABLE MyTable
        (MyID int NOT NULL,
         Filler char(8000) NOT NULL
        );
    
    CREATE CLUSTERED INDEX MyIndex ON MyTable(MyID);
    
    DECLARE @i int = 1;
    WHILE @i < 100
    BEGIN;
        INSERT INTO MyTable VALUES (@i, '');
        SET @i += 1;
    END;
    
    SET STATISTICS IO ON;
    
    SELECT MyID FROM MyTable;             -- Reads 101 pages
    SELECT MIN(MyID) FROM MyTable;    -- Reads 2 pages
    SELECT MAX(MyID) FROM MyTable;    -- Reads 2 pages
    
    SET STATISTICS IO OFF;
    
    DROP TABLE MyTable;
    go

    If you run the three SELECT statements with display actual execution
    plan turned on, you'll see that they all involve a scan operator. But
    the second and third also include a top operator (which ensures that
    the scan doesn't process the entire table, but stops after a few rows.
    One, in this case). And if you check the properties of the scan
    operator in the 2 last plans, you'll also see that the scan direction
    is FORWARD in the first, and BACKWARD in the second. So the last one
    starts at the end, reads a single row, and is then stopped by the top
    operator (or, actually, the top operator simply just calls the scan
    operator once only).


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    2012年3月7日 上午 11:48
  • Hi wBob,

    Try this:

    SELECT MAX(MyIdentifier)
    FROM dbo.MyTable
    WHERE MyTableTime = ( SELECT MAX(MyTableTime) FROM dbo.MyTable )
    

    That will probably be very fast - but will it also be correct? How do
    you know if the maximum MyIdenitfier value is actually associated with
    the maximum MyTableTime value?
    If it is, then the above query might indeed work.


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    2012年3月7日 上午 11:51
  • Yes, realised and edited the post : )

    2012年3月7日 上午 11:58
  • Hi Hugo

    Sorry, no, the Id from the original is MyTableId, and the index the optimizer chooses is PK_MyTable (index scan). The query is this

        SELECT MAX(MyTableId)
        FROM MyTable

    I don't know how much this affects what you go on to say but I'm replying quickly whilst I read the rest.

    Cheers

    Charles

    2012年3月7日 下午 12:02
  • Hi Bob

    I've just replied to Hugo that the query is selecting MAX of the identifier column, so that probably means that the sub-select will not help?

    Charles

    2012年3月7日 下午 12:04
  • I don't know how much this affects what you go on to say but I'm replying quickly whilst I read the rest.
    I don't think it really affect what I say in my post. The same issue still exists - the index is partitioned by the date column; SQL Server doesn't know in which partition the highest values can be found, so it has to check them all.

    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis

    2012年3月7日 下午 12:05
  • I think I'm starting to understand what's going on. That being the case, could it not look at the PK_MyTable index in each of the [15] current partitions and return the maximum MyTableId from those? It still seems like a quick task to take the last value from each and return the maximum, or am I still missing the point?

    Charles

    2012年3月7日 下午 12:13
  • Hi Charles,

    I've just replied to Hugo that the query is selecting MAX of the identifier column, so that probably means that the sub-select will not help?

    It depends - it could help.

    Since the identifier column is an identity column, it will always
    increase (assuming you don't exxplicitly tamper with it). If (and
    that's a big if that only you can address) the values in the datetime
    column also always increase, for instance because you always use
    CURRENT_TIMESTAMP when inserting rows and never change the values in
    this column (or tamper with the system clock), then it would work,
    since the combination of these ingeredients ensures that the highest
    identifier value is always in the same row as the highest date value.

    On the other hand, given that the column has the identity property,
    have you considered using the IDENNT_CURRENT function?


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    • 已標示為解答 ckl42 2012年3月7日 下午 01:15
    2012年3月7日 下午 12:14
  • Hi Charles,

    I think I'm starting to understand what's going on. That being the case, could it not look at the PK_MyTable index in each of the [15] current partitions and return the maximum MyTableId from those? It still seems like a quick task to take the last value from each and return the maximum, or am I still missing the point?

    That's what I would expect to happen. I don't know why it doesn't, or,
    if it does, why it still takes so long. (But I must admit that I have
    little experience with partitioned tables).

    The problem is that I can't really troubleshoot this. Plans chosen by
    the optimizer depend on the amount of data and the distribution of the
    data in the statistics. So to test this behaviour here, I'd need the
    same amount of data you have, and with the same distribution. Just
    popping in a few sample rows will not help with this problem!

    That being said - have you already inspected the execution plan of the
    query? And have you tried if updating the statistics on your table
    changes the plan choice?


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    2012年3月7日 下午 12:27
  • Brrrrrrilliant! I tried

        SELECT MAX(MyTableId)
        FROM MyTable
        WHERE MyTableId >= IDENT_CURRENT('MyTable')

    and it returns instantly, with the right value.

    Thanks a lot Hugo.

    Charles

    2012年3月7日 下午 01:09
  • I checked the partition fragmentation, and although most have very low fragmentation, the last one is heavily fragmented. I'm guessing that is the problem. I can see I will have to go for a maintenance window very soon to put that right, not least for this reason.

    Cheers

    Charles

    2012年3月7日 下午 01:12
  • Hmmm, that was not exactly what I had in mind, and I'm actually
    surprised that this helps - but the most important thing is that you
    got your problem solved!


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    2012年3月7日 下午 01:13
  • Just out of interest, then, what were you intending?

    Charles

    2012年3月7日 下午 01:17
  • Isn't this the same as simply:

    SELECT IDENT_CURRENT('MyTable')
    

    2012年3月7日 下午 01:20
  • Er, yes, good spot. I got so wrapped up in the original query I lost sight of what I was trying to achieve.

    Charles

    2012年3月7日 下午 01:49
  • >Just out of interest, then, what were you intending?

    SELECT IDENT_CURRENT('MyTable');

    But I see wBob beat me to it. ;-)


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    2012年3月7日 下午 01:58
  • Yes I was going to say I think this is what Hugo was intending so can't really take the credit on that one : )

    2012年3月7日 下午 02:24