none
Wide clustered index column order

    Question

  • Hi,

    I have what I imagine is a pretty common data warehouse scenario. There is a table containing items by business date and source system id. There are a number of source systems and the data is loaded and stored for each business date.

    Items from the same source system with the same name are allocated the same surrogate key so that an item can be tracked over business dates. So itemID is not a unique key.

    The table looks roughly like this: Item (Name varchar, ItemID int, SourceID int, BusinessDate datetime)

    The table is not currently partitioned, but it would make a lot of sense to partition it on business date.

    Most of the select queries will restrict or group on BusinessDate and SourceID

    I am assuming that the clustered index should reflect the logical tree structure of the data and so the column order should be:

    BusinessDate, SourceID, ItemID

    However, I have also read that wide index keys shold have the most selective colum first, which would be ItemID.

    Does anyone have any recommendations on the index column order and why ?

    Wednesday, July 24, 2013 7:56 AM

Answers

  • >However, I have also read that wide index keys should have the most selective column first

    This is a myth.  In addition to not usually being good advice, it's usually bad advice.  Putting the least selective column first will typically group logically related rows together, which often helps with insert performance and query performance.

    So it depends and rules-of-thumb are bad.  But that's just another rule-of-thumb.  Sometimes rules-of-thumb are good.  So here's a new one:

    In a data warehouse where you have a compound clustered index, partition by a non-leading column in the clustered index.

    So either cluster (SourceID, BusinessDate) and partition on BusinessDate, or cluster on (BusinessDate,SourceID) and partition on SourceID.  Physically you get to choose between having separate structures for each range of SourceID (perhaps each SourceID), or having separate structures for each date range.  There are tradeoffs.

    David


    David http://blogs.msdn.com/b/dbrowne/


    Wednesday, July 24, 2013 2:35 PM

All replies

  • CREATE CLUSTERED INDEX MyINDX on Table1
    (
       MostSelective,
       SecondMost,
       Least
    )

    first column eliminates more results than restricting on your second column first? It's easier if you picture how the index must be traversed across, column 1, then column 2, etc...you see that lopping off most of the results in the fist pass makes the 2nd step that much faster.

    Another case, if you queried on column 3, the optimizer wouldn't even use the index, because it's not helpful at all in narrowing down the result sets. Anytime you're in a query, narrowing down the number of results to deal with before the next step means better performance.

    Since the index is also stored this way, there's no backtracking across the index to find the first column when you're querying on it.


    Thanks and Regards, Prajesh Please use Marked as Answer if my post solved your problem and use Vote As Helpful if a post was useful.

    Wednesday, July 24, 2013 8:18 AM
  • I see the logic of this argument.

    But doesn't this mean that the "index" is in some sense "bigger" because there are more logical nodes required if the most selective column (ItemID) is first as the tree is now inverted. Under the root there are many ItemIDs and under each there are BusinessDate nodes and under those there are SourceSystemID nodes. The query has to seek through more keys to find the Item

    Or maybe the physical implementation doesn't actually work like that ...

    It isn't obvious, at least to me, which column order would work best.


    • Edited by Graham Hay Wednesday, July 24, 2013 10:19 AM Clarification
    Wednesday, July 24, 2013 8:56 AM
  • There is no one absolute and correct answer.  The best choice for a clustered index involves an analysis of the queries that use that table - and potentially evaluating the effect of different choices to determine which provides the best overall performance from many viewpoints.  Keep in mind that you need to consider update queries as much as you consider select queries.  You only have one clustered index and it is generally used to facilitate range-based queries.  Along those lines, it seems to me that date is as important a consideration as anything else and such a column is often the first in a clustered index.  See Tom's response to a similar question (which, to beat a dead horse, you should have searched for before posting since this is a common question).

    clustered key selection

    Wednesday, July 24, 2013 1:11 PM
  • Hallo Graham,

    several options have to be consideres with "multi-column-indexes" in general and clustered in special:

    1. How will data be inserted into the relation (that answer affects the degree of fragmentation of the clustered index!)

    2. How will you use the clustered keys as predicates in your queries?
    Concerning your example you have multiple options which makes the decision quite complex:

    SELECT ... FROM ...
    WHERE BusinessDate = x AND SourceId = y AND ItemId = z

    If this is the preferred method and you can confirm that ALWAYS BusinessDate is part of every query the suggestions made by others are optimal, BUT...

    SELECT ... FROM ...
    WHERE SourceId = y AND ItemId = z

    Concerning the clustered index is covering all three key attributes the above query will always use a clustered index scan because the most left predicate will not be considered in the query and the index is unusable! Think about YOUR search behaviour if you look for names in a phone book and you only have the first name. HOw do you have to search in the phone book? You have to search the whole book because you don't have a "step in" where you can start your search :)

    SELECT ... FROM ...
    WHERE BusinessDate = x AND SourceId = y

    This option will work pretty fine because the most left attributes can be used as SEEK predicates!

    SELECT ... FROM ...
    WHERE BusinessDate = x AND ItemId = z

    This option may have good results, too because the most left attribute can be used as SEEK predicate and can reduce the amount of data significantly depending on the cardinality / selectivity of it's values.

    Another - underestimated - concern is the amount of data which will be used in each other non clustered index because each index has to store the key attributes of the clustered index on top of the key attributes of the indexes itself!

    My recommendation:

    You should first analyse the "most used query habits" by checking the queries which are running against the fact table. Than you'll get a picture of the usage of the relation and it's used predicates.

    That result should drive your decision for the best clustered index :)


    MCM - SQL Server 2008
    MCSE - SQL Server 2012
    db Berater GmbH
    SQL Server Blog (german only)


    Wednesday, July 24, 2013 2:11 PM
  • >However, I have also read that wide index keys should have the most selective column first

    This is a myth.  In addition to not usually being good advice, it's usually bad advice.  Putting the least selective column first will typically group logically related rows together, which often helps with insert performance and query performance.

    So it depends and rules-of-thumb are bad.  But that's just another rule-of-thumb.  Sometimes rules-of-thumb are good.  So here's a new one:

    In a data warehouse where you have a compound clustered index, partition by a non-leading column in the clustered index.

    So either cluster (SourceID, BusinessDate) and partition on BusinessDate, or cluster on (BusinessDate,SourceID) and partition on SourceID.  Physically you get to choose between having separate structures for each range of SourceID (perhaps each SourceID), or having separate structures for each date range.  There are tradeoffs.

    David


    David http://blogs.msdn.com/b/dbrowne/


    Wednesday, July 24, 2013 2:35 PM