locked
To GUID or not to GUID (for a clustered index)... RRS feed

  • Question

  • Please bear with me:

    We are writing a process in which our web application populates tables based on user input.  These tables are assigned sequential GUIDs through the application.  The GUIDS become primary keys in one table, and foreign keys in a different table.  We use this relationship, so we can push data from both tables into a separate table near the end of the process.  The whole process happens with the click of one button, so the data needs to move quickly and efficiently.  We don't forsee there being more than a thousand rows in these two tables at any one time (usually a lot less).

    The developer writing this and our development manager are debating over the best way to do it, and I'm stuck in the middle.

    Developer says:  If we use the sequential GUIDS as the clustered index, it will be fast.  Plus, since they are sequential, there won't be the fragmentation that you get with normal GUIDs on an index.

    Manager says: Every other clustered index in our DB is an integer identity column (which works well for what we are doing).  We should stick with that design for consistency, but add a column for the sequential GUID, which will be the Primary Key, a non-clustered index, and unique; unless the developer's suggestion has a really big performance boost.

    I say: The only way to truly know is to generate a bunch of test data, and try it both ways.

    I know that using a normal GUID will fragment your clustered index, but I've heard from some pretty advanced SQL guys that using a sequential GUID for the clutered index is also bad (although I can't remember why).

    Just to clarify: These guids are not NEWID() or NEWSEQUENTIALID().  they are sequential GUIDS created by the web application.

    Does anyone have any general suggestions?  We're using SQL server 2005 SP2 32 bit on a virtualized Windows Server 2003 box, in case it matters to anyone...

    Thanks in advance!


    Clint
    • Edited by NashVegas Tuesday, March 9, 2010 11:05 PM left out something
    Tuesday, March 9, 2010 10:47 PM

Answers

  • I know what you mean, but the clustered index has little to do with meaning.  It is all about performance. If you do a very high amount of your row fetches by a certain path, having the clustered index on that path is likely going to be best. 

    On the other hand, if you can't pinpoint a perfect place for the clustered index based on usage, having it on a very small keyed value like a unique integer value is ideal, due to the fact that all other indexes use the value as the row locator key, which in turn is used to do lookups in the clustred index for every usage of a non-clustered index.

    In the end, using the surrogate key for the clustered index key is the "easy" win.  It isn't 100% at all, but it does serve a very large percentage of the cases in the real world.
    Louis

    Thursday, March 18, 2010 4:44 AM
  • I certainly agree with your manager.If you have some standards for your all applications, then first of all we should follow them rather implementing something which has N number of questions. I think that your manager has made a Dev standards after doing lot of investigation and analysis and in other case if we create PK on GUID column with clustered, then Yes we have to pay price for the fragmentation as they are randomly generated and we do not know which page it is going to reside.

    I had this experience in one of my previous company and later we decided to go with NON cluster key on GUID column.

    To test this -
    1. Create two simple tables, one having Clustered index on Identity column and another having Clustered index on GUID column.
    2. Take the index statistics by DBCC SHOWCONTIG (OR) DMVs
    3. Try to insert 1000 records into each table
    4. Again Take the index statistics by DBCC SHOWCONTIG (OR) DMVs

    Now you can compare, which has got fragmented more and can decide further.

    And related to PK and FK relation, you can maintain same for both the cases.
    Monday, March 15, 2010 7:22 AM

All replies

  • GUIDs, sequential or otherwise, do not make good PKs, especially when they're going to be FKs as well. This is because they are 16 bytes, rather than 1, 2, 4, or 8 for their tinyint, smallint, int, and bigint alternatives. I use the smallest practical integer-type identity PKs as a clustered index, especially when it will be used as an FK. Size is important for performance because the more narrow the table the more records can be returned per page read.

    Your app logic seems a bit strange. Why is the web app assigning a guid to a set of parent data? Instead, why not save the data to a parent table, return the SQL-generated identity integer-type PK to the web app, and use it as a FK in the table that contains the child data? That way, you can get rid of the guid entirely.

    FWIW, I don't think performance should be a big problem with 1000 record tables. Long story short, I go with your manager's proposal if you can't implement the standard approach in the second paragraph.
    Wednesday, March 10, 2010 1:50 AM
  • GUIDs are clearly less effective than something smaller like an integer, but it really isn't that horrible. The way B-Tree indexes are formed it isn't that horrible.  Clearly your answer is best: "Try it"

    What I don't quite get is the mixed answer. It seems to me that the reason for the second table having the guid is to have the same access form as the other table, but if you use an identity for the primary key, it seems very unwieldy to put them into both.  Unless you need the link between the two tables, then why not have the GUID for the "staging" table, and then the identity key for the "permanent" table.

    Sequential GUIDs will be far better than non-sequential (and be sure that the binary value is sequential, not just the text value), but you should still expect to have to do more maintenance of the index due to the index pages filling up 4 times as fast as with an integer.

    Agree with the other poster though, 1000 rows is completely inconsequential and unless you are doing something amazing, it would probably run on my low power pentium tiny laptop I am posting this on.


    Louis

    Wednesday, March 10, 2010 4:50 AM
  • But is a GUID PK better then a composite key PK for a big table, let's say 1 billion rows ? Theoretically the composite key PK should be better as size and because is defined an order even the GUID can be sequential, what can be the diff in perf for these 2 ?

    Dan Vilnoiu
    Sunday, March 14, 2010 9:49 AM
  • That is a good question, and I was answering the question of GUID as a surrogate key. I always prefer a surrogate key over a natural key for a primary  (composite or not), simply because it tends to make the implementation more clean.  Clustering on a sequential value of any sort is better just for managment of fragmentation if nothing else. So this topic is more about what to cluster on than what to choose as a PK (or at least it should be :).  Every table ought to have at least one "natural" key that is not a system generated value.  System generated values are not sufficient for the only key since while they might be guaranteed unique, the rest of the columns of the table would not be guaranteed to be.

    But, having said that, the clustering key shouldn't be automatically given to the surrogate... If you are always doing singleton fetches by primary key, then it is almost always best.  But as an example, in an invoice/invoice line item situation, it might be better to cluster the invoice on the surrogate key, but then cluster the invoice line item on the composite key of the migrated invoice primary key and the invoice line item number (or whatever makes the row unique along with the invoice key.

    THis could be because you primarily get the invoice items by the invoice primary key. If you do that a significant number of times saving the bookmark lookups can really save a lot. Updates via the pk would incur the bookmark lookup, but it would be a single row and a single probe into the clustered index.

    So whereas the question of primary key is a personal preference, the clustered index choice is a performance question, and while the primary key being clustered is *usually* the best answer...not always.
    Louis

    Sunday, March 14, 2010 7:01 PM
  • I certainly agree with your manager.If you have some standards for your all applications, then first of all we should follow them rather implementing something which has N number of questions. I think that your manager has made a Dev standards after doing lot of investigation and analysis and in other case if we create PK on GUID column with clustered, then Yes we have to pay price for the fragmentation as they are randomly generated and we do not know which page it is going to reside.

    I had this experience in one of my previous company and later we decided to go with NON cluster key on GUID column.

    To test this -
    1. Create two simple tables, one having Clustered index on Identity column and another having Clustered index on GUID column.
    2. Take the index statistics by DBCC SHOWCONTIG (OR) DMVs
    3. Try to insert 1000 records into each table
    4. Again Take the index statistics by DBCC SHOWCONTIG (OR) DMVs

    Now you can compare, which has got fragmented more and can decide further.

    And related to PK and FK relation, you can maintain same for both the cases.
    Monday, March 15, 2010 7:22 AM
  • That is a good question, and I was answering the question of GUID as a surrogate key. I always prefer a surrogate key over a natural key for a primary  (composite or not), simply because it tends to make the implementation more clean.  Clustering on a sequential value of any sort is better just for managment of fragmentation if nothing else. So this topic is more about what to cluster on than what to choose as a PK (or at least it should be :).  Every table ought to have at least one "natural" key that is not a system generated value.  System generated values are not sufficient for the only key since while they might be guaranteed unique, the rest of the columns of the table would not be guaranteed to be.

    But, having said that, the clustering key shouldn't be automatically given to the surrogate... If you are always doing singleton fetches by primary key, then it is almost always best.  But as an example, in an invoice/invoice line item situation, it might be better to cluster the invoice on the surrogate key, but then cluster the invoice line item on the composite key of the migrated invoice primary key and the invoice line item number (or whatever makes the row unique along with the invoice key.

    THis could be because you primarily get the invoice items by the invoice primary key. If you do that a significant number of times saving the bookmark lookups can really save a lot. Updates via the pk would incur the bookmark lookup, but it would be a single row and a single probe into the clustered index.

    So whereas the question of primary key is a personal preference, the clustered index choice is a performance question, and while the primary key being clustered is *usually* the best answer...not always.
    Louis

    Usually, I use the identity column as the primary key, but not the clustered index. I think it's usefull and efficient to make one column clustered when you need rows returned ordered by that column for some business reason, while identity column usually  doesn't have any practical meaning.
    Wednesday, March 17, 2010 11:26 AM
  • I know what you mean, but the clustered index has little to do with meaning.  It is all about performance. If you do a very high amount of your row fetches by a certain path, having the clustered index on that path is likely going to be best. 

    On the other hand, if you can't pinpoint a perfect place for the clustered index based on usage, having it on a very small keyed value like a unique integer value is ideal, due to the fact that all other indexes use the value as the row locator key, which in turn is used to do lookups in the clustred index for every usage of a non-clustered index.

    In the end, using the surrogate key for the clustered index key is the "easy" win.  It isn't 100% at all, but it does serve a very large percentage of the cases in the real world.
    Louis

    Thursday, March 18, 2010 4:44 AM
  • Ooops misclicked Propose Answer! Sorry.

     

    I empathise with Nash entirely.

    We develop a large enterprise-sized medical application. We use guids as PKs generated by the app for all meta-data, so that we can write SQL scripts to initialise the system with data in such tables. We have over 1500 meta-data tables to define how the application hangs together, along with serialized screen definitions and business rule engine configuration sets

    We use Guids to internally PK/FK relate all user-entered data too. This makes it easy for our support analysts so that if say there is a problem with some data or meta-fields we can just do an extract and insert the raw data into a fresh database knowing we arent going to have to massage that data to get it to insert in another DB-ie we can easily mirror subsets of sites data to a different environment. Sure lookups arent as fast, but then again we arent shifting 1000s of tuples per transaction: normally CRUDing up to 100 max.


    This isnt an automated signature, I just like typing.
    Thursday, May 13, 2010 7:35 PM