locked
Which index is better? RRS feed

  • Question

  • All things being equal, which is a better indexing strategy?

    Scenario:

    Fact table, 100 million rows of monthly data.  Approx. 1.5 million new rows per month.  Non partitioned.  Users query most recent month, 3 months or 12 months the most, then additional filters after that.  Table contains five years of history (60 months).  Reading data is more important to users that data inserts, which happen only monthly, and then just overnight.

    Index strategy A:

    Unique Clustered index on an identity key - Primary key constraint; non clustered indexes on other filtering columns, including the Date column

    Index strategy B:

    Non unique Clustered index on date column; primary key constraint on identity column; non clustered indexes on other filtering columns, first excluding date column, then trying with non clustered index on date column

    ---------------

    Now, according to what I've "read", scenario B would be better performing for reads, whereas scenario A would be better performing for Inserts.  But when I set this up in a test environment, I am seeing the opposite effect in most cases.

    When set up as scenario A, a select query filtering on date (where date >= 201004) it does an index scan and the IO stats and query times look good.

    When set up as scenario B, with no non-clustered index on date, the same query does a clustered index scan but the IO stats and query times are worse than scenario A.

    When set up as scenario B, with a non-clustered index on date, the same query does an index scan and performs better than without the non-clustered index, however is still worse than scenario A.

    So, why do I see it recommended to put a clustered index on a date (or other "range"-type columns) for large fact tables? Or does this only hold true if the table is partitioned on that column?

    Thursday, December 2, 2010 10:06 PM

Answers

  • Sorry, you are right- they were seeks.  But that doesn't change my original question.

    I ran the queries again, clearing cache beforehand and making sure I had just reindexed and updated statistics fullscan to eliminate fragmentation issues, etc.  When I ran both in a batch in SSMS here was the outcome:


    For those queries the non-clustered index will be faster, because the index covers the query; no need to read the data pages. You can see this from the query plans. And the NC index is smaller than the clustered index.

    If you instead try a query like:

    SELECT MIN(somecol) FROM table_a WHERE DATE_KEY > 201005

    Where somecol is not part of the clustered index, you will get quite a different outcome.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Links for SQL Server Books Online: SQL 2008, SQL 2005 and  SQL 2000.
    (Just click the link you need.)
    • Marked as answer by KJian_ Wednesday, December 8, 2010 3:44 AM
    Friday, December 3, 2010 10:42 PM

All replies

  • When set up as scenario A, a select query filtering on date (where date >= 201004) it does an index scan and the IO stats and query times look good.

    When set up as scenario B, with no non-clustered index on date, the same query does a clustered index scan but the IO stats and query times are worse than scenario A.


    You should see index seeks, not scans. The fact that you see scans suggests that there is some kind of implicit conversion causing trouble.

    Given the fact that you get scans, it's no wonder that B is slower, since you now have scan the full table, instead of only the index.

    What is the actual data type of the column date?


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Links for SQL Server Books Online: SQL 2008, SQL 2005 and  SQL 2000.
    (Just click the link you need.)
    Thursday, December 2, 2010 10:58 PM
  • Sorry, you are right- they were seeks.  But that doesn't change my original question.

    I ran the queries again, clearing cache beforehand and making sure I had just reindexed and updated statistics fullscan to eliminate fragmentation issues, etc.  When I ran both in a batch in SSMS here was the outcome:

     

    Query 1:

     

    DBCC FREEPROCCACHE

    DBCC DROPCLEANBUFFERS

     

    SELECT COUNT(*) FROM TABLE_A WHERE DATE_KEY >= 201005 (integer)

    PK clustered on identity, unique/ Non-clustered IX on date_key

    6% of batch

    Index Seek -> Stream Agg -> Compute Scalar -> Select

    Scan count 1, logical reads 1975, physical reads 33, read-ahead reads 1968, CPU time = 93 ms,  elapsed time = 553 ms.

     

    Query 2 (run1 - no non-clustered indexes): 

     

    DBCC FREEPROCCACHE

    DBCC DROPCLEANBUFFERS

     

    SELECT COUNT(*) FROM TABLE_B WHERE DATE_KEY >= 201005 (integer)

    Clustered on date_key, non-unique; PK on identity; no non-clustered indexes

    94% of batch

    Clustered Index Seek -> Stream Agg -> Compute Scalar -> Select

    Scan count 1, logical reads 46148, physical reads 595, read-ahead reads 46143, CPU time = 78 ms,  elapsed time = 7586 ms.

     

    Query 2 (run2 - with non-clustered index on date_key): 

     

    DBCC FREEPROCCACHE

    DBCC DROPCLEANBUFFERS

     

    SELECT COUNT(*) FROM TABLE_B WHERE DATE_KEY >= 201005 (integer)

    Clustered on date_key, non-unique; PK on identity; non-clustered index on date_key

    Index Seek -> Stream Agg -> Compute Scalar -> Select

    Scan count 1, logical reads 934, physical reads 13, read-ahead reads 929, CPU time = 62 ms,  elapsed time = 189 ms.

     

    So, from this experiment, is it fair to say that if you are doing range-type queries against large tables, not only should you put a clustered index on the range key, but a non-clustered index as well?

    Friday, December 3, 2010 3:42 PM
  • Sorry, you are right- they were seeks.  But that doesn't change my original question.

    I ran the queries again, clearing cache beforehand and making sure I had just reindexed and updated statistics fullscan to eliminate fragmentation issues, etc.  When I ran both in a batch in SSMS here was the outcome:


    For those queries the non-clustered index will be faster, because the index covers the query; no need to read the data pages. You can see this from the query plans. And the NC index is smaller than the clustered index.

    If you instead try a query like:

    SELECT MIN(somecol) FROM table_a WHERE DATE_KEY > 201005

    Where somecol is not part of the clustered index, you will get quite a different outcome.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Links for SQL Server Books Online: SQL 2008, SQL 2005 and  SQL 2000.
    (Just click the link you need.)
    • Marked as answer by KJian_ Wednesday, December 8, 2010 3:44 AM
    Friday, December 3, 2010 10:42 PM