none
Performance of Surrogate Key vs Composite Keys

    Question

  • I know there are tons of debates about this online.  I can't seem to find any good articles on performance of a composite key vs a integer surrogate key.  I'm still torn on should I use a 3 column composite key or use a surrogate key.  The composite key would be (varchar(25), varchar(25), varchar(10)).  The surrogate key would be smallint.  Is there a row count threshold to recommend a composite key over a surrogate key?  Does anyone have any performance tests on the two types of keys?

    Thanks,


    Ryan

    Friday, June 22, 2012 8:56 PM

Answers

All replies

  • You can also see it the other way around.

    For a composite key of 60 bytes, you can only store some 130 rows per page. With a smallint surrogate key, you can store some 4000 rows per page. Not including leaf nodes. That's an indicator the narrower key will perform better, because there are less pages to be read.


    N 56°04'39.26"
    E 12°55'05.63"

    Friday, June 22, 2012 9:06 PM
  • That depends a lot on how you will use the data. SELECT queries may be faster with the surrogate, but update operations may prove to be more complex.

    If these three columns are coming from the outside world, and may change, I would use a surrogate key. (And I would probably make it an int to avoid egg on my face where there are more than 32767 rows in the table.)

    If this a table that joins three other tables together, I would be less inclined to use a surrogate.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Friday, June 22, 2012 9:27 PM
  • This question was a subject for this T-SQL quiz

    http://beyondrelational.com/quiz/sqlserver/general/2010/questions/sqlserver-quiz-general-2010-madhu-k-nair-surrogate-key-vs-natural-key.aspx

    You can read through all answers or go directly to the very last page where I posted two late answers with very good links.


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    • Proposed as answer by Syed Qazafi Anjum Saturday, June 23, 2012 10:32 AM
    • Marked as answer by Ryan0827 Tuesday, June 26, 2012 1:54 AM
    Friday, June 22, 2012 10:07 PM
    Moderator
  • use surrogate key for better performance.check below link for more info
    http://kresimirbojcic.com/2011/06/19/composite-vs-surrogate-key-a-battle-to-the-death.html

    SQL Champ
    Database Consultants NY

    Saturday, June 23, 2012 4:49 AM
  • I have chapre4rs in my books on key which you might want to read. First you maintain darta integrity then you worry about performance. T-SQL does not have surrogates. Doesn't anyone read any more? 


    A surrogate key is system generated to replace the actual key behind the covers where the user never sees it. It is based on attributes in the table. Example: Teradata hashing algorithms, indexes, pointer chains, ADABASE numbers, etc. 

    The fact that you can never see it or use it for DELETE and UPDATE or create it for INSERT is vital. When users can get to them, they will screw up the data integrity by getting the real keys and these physical locators out of synch. The system must maintain them. 

    Notice that people get "exposed physical locator" and surrogate mixed up; they are totally different concepts. 

    An appeal to authority, with a quote from Dr. Codd: "..Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434. 

    This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does. 

    Codd also wrote the following:

    "There are three difficulties in employing user-controlled keys as permanent surrogates for entities.

    (1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).

    (2) Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.

    (3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree).

    These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410).

    in the T-SQL world, back in the says when programmers did not know the RM, worked on slow 16-bit hardware and had lousy optimizers, they would use IDENTITY as an exposed physical locator or generate a fake pointer (I guess that is your SMALLINT). 

    All this does is waste disk since the real key still has to be verified, validated and indexed. What used to happen is that people did not bother with data integrity and built fake pointer chains to improve performance, and to keep a familiar mindset. 

    Today, none of that is true except that the time to read a byte from a disk is much greater than the time to compare 100 byres in main storage. 

    In your example of a composite key string (VARCHAR(25), VARCHAR(25), VARCHAR(10)) is unusual. Most industry standard encoding schemes use the required UNICODE Latin subset in all the languages and their lengths are fixed at 10 or shorter. I hope you are not like the guy 
    who posted a declaration of “country VARCHAR(MAX)” instead of using the three letter ISO country code. Talk about inviting garbage data! 

     

    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    Saturday, June 23, 2012 2:47 PM