locked
Advantages/Disadvantages of using an Identity column in a table as Primary Key RRS feed

  • Question

  • I am trying to find the advantages of having an Identity on a column versus incrementing a sequential number as the primary key manually.

    There is a valid reason why I am thinking to do the incrementing manually. So, i just want to know what am I going to miss out not using

    the Identity.

     

    Thanks in advance!

    Tuesday, September 13, 2011 10:22 PM

Answers

  • It's still not clear for me why do you want to keep ID = User_ID for the original/first version of the user.

    How do you reference the table - by unique ID or by pair of ID/User_ID?

    In the first case (referencing by ID), I'd something like that:

    create table dbo.Users
    (
    	UserID int not null,
    	ID int not null default identity(1,1)
    	...
    )


    and generate UserId with something like that: http://aboutsqlserver.com/2010/10/03/sunday-t-sql-tip-how-to-generate-pseudo-identity-values/

     

    You also need to think about CI/NCI in such case - it should be either CI on (UserId, ID) and unique NCI on (ID) or vise versa. Depends on your queries.

    If you reference table by the pair of values(User_Id, ID), I'd probably replace ID with the version and don't bother with identity at all. Something like that:

    create table dbo.Users
    (
    	UserID int not null,
    	VersionID int not null 
    		constraint DEF_Users_VersionId
    		default 0,
    
    	constraint PK_Users
    	primary key clustered(UserId, VersionId)
    )
    

    I'd generate UserID the same way as above and increment versionId for the new versions of the row starting with 0.

    Last, but not least, if you don't need to reference anything but last version of the user, I'd keep only the one version in this table and store old versions (history of the changes) in the different place


    Thank you!

    My blog: http://aboutsqlserver.com


    Thursday, September 15, 2011 12:27 AM

All replies

  • Well, let's say that manual implementation of the sequence would be (most likely) slower than identity. Benefit is that you can make the implementation that does not have any holes in the sequence (SQL Server does not "rollback" identity value if transaction has been rolled back). Usually it's not a big deal though.

    Speaking of the clustered index - which could be different than PK though - it's a different story.

    Identity (or any other auto increment values) as the clustered index have pros and cons. Pros - it's simple and usually introduce minimal fragmentation of the index on insert stage. It also efficient in terms of referential integrity in compare with composite keys (referencing column size and index row in the referencing table are smaller).

    Speaking of cons - there are 2 main issues. First - autoincrement CI introduces hot spots when your system is under heavy load. Think about multiple sessions trying to insert data to the same page - there would be latches to serialize page modification process in the memory as well as for the page/extent allocations. It's usually become an issue when system has hundreds of inserts per second (depends on the hardware, row size and a few more factors).

    Another problem that autoincrement clustered index rarely covers any queries in the system. You have to build NCI for them. For the large transactional tables NCI/key lookup usually expensive and in a lot of cases introduces a lot of physical reads.

    So for the large transactional tables I'd suggest to define CI that helps most frequent queries in the system. Same time do not make it wide - don't forget that columns from the CI are included to every NCI on the table even if you don't define them there.


    Thank you!

    My blog: http://aboutsqlserver.com

    • Proposed as answer by Naomi NEditor Wednesday, September 14, 2011 3:49 AM
    Wednesday, September 14, 2011 12:45 AM
  • >>I am trying to find the advantages of having an Identity on a column versus incrementing a sequential number as the primary key manually.

    I hate to post a link on the subject but you may want to read the following

    http://www.sqlteam.com/article/identity-and-primary-keys

    Wednesday, September 14, 2011 12:58 AM
  • There is a valid reason why I am thinking to do the incrementing manually. So, i just want to know what am I going to miss out not using

    the Identity.

     

    Can you share you reason?  Not to say there isn't a valid reason to do so but why not let the DBMS handle it?

     


    Dan Guzman, SQL Server MVP, http://weblogs.sqlteam.com/dang/
    Wednesday, September 14, 2011 2:19 AM
    Answerer
  • Here are a few videos by Kimberly Tripp that you might want to also review:

    Clustered Index:

       http://technet.microsoft.com/en-us/sqlserver/gg508879.aspx
          "The Clustered Index Debate", AV Presentation by Kimberly Tripp (SQL Server Masters series)

       http://www.sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-the-clustering-key.aspx
           Kimberly Tripp

    Index Internals:

       http://technet.microsoft.com/en-us/sqlserver/video/gg508878
          Kimberly Tripp

       http://technet.microsoft.com/en-us/sqlserver/video/gg508877
          (Demo)
          Kimberly Tripp

       http://technet.microsoft.com/en-us/sqlserver/gg545006.aspx
          "Indexing Strategies" AV Presentation by Kimberly Tripp (SQL Server Masters series)

     

    Wednesday, September 14, 2011 12:35 PM
  • Ideally, an incrementing number is not used as a PK as its adds nonsensical data where there should be inherent identifiers. Nonetheless, there are times that an inherently unique identifier is not available, hence we need to make one up.

    Keeping the number unique is not always easy, as MAX(Id) + 1 is slow because it needs to scan the TABLE (or read an INDEX) to get the current high value. It also does not allow caching of numbers, because who knows who else is also trying to cache some numbers.

    Oracle implemented the SEQUENCE which is like a variable that stores the high number. This answers both issues, in that it is just a variable (actually, a record in a system TABLE) avoiding the TABLE scan or INDEX read, and to cache, the number is simply incremented by the cache amount. To INSERT, one needs to pop a number off the SEQUENCE either manually, or via an INSERT TRIGGER.

    Autonumber is is the equivalent of an Oracle SEQUENCE/TRIGGER. It is a kludge, but very convenient.

    So, when deciding what to use, use the following steps:

    1. An inherently unique identifier for this entity.
    2. A manually added meaningful identifier.
    3. A (meaningless) numerical identifier.
      1. If the number needs to be incremental and gaps, add it manually.
      2. If not, use Autonumber.

     


    Wednesday, September 14, 2011 12:43 PM
    Answerer
  • Ideally, an incrementing number is not used as a PK, as its adds nonsensical data where there should be inherent identifiers. Nonetheless, there are times that an inherently unique identifier is not available, hence we need to make one up.

    Keeping the number unique is not always easy, as MAX(Id) + 1 is slow because it needs to scan the TABLE (or read an INDEX) to get the current high value. It also does not allow caching of numbers, because who knows who else is also trying to cache some numbers.

    Oracle implemented the SEQUENCEs which is like a variable that stores the high number. This answers both issues, in that it is just a variable (actually, a record in a system TABLE) avoiding the TABLE scan or INDEX read, and to cache the number is simply incremented by the cache amount. To INSERT, one needs to pop a number off the SEQUENCE either manually, or via an INSERT TRIGGER.

    Autonumber is is the equivalent of an Oracle SEQUENCE/TRIGGER. It is a kludge, but very convenient.

    So, when deciding what to use, use the following steps:

    1. An inherently unique identifier for this entity.
    2. A manually added meaningful identifier.
    3. A (meaningless) numerical identifier.
      1. If the number needs to be incremental and gaps, add it manually.
      2. If not, use Autonumber.

     


    I was just going to add comments about Oracle and DB2 sequence objects.  Keep in mind that it looks like sequence objects will be availble in the next release of SQL Server -- presently "Denali".
    Wednesday, September 14, 2011 12:46 PM
  • I was just going to add comments about Oracle and DB2 sequence objects.  Keep in mind that it looks like sequence objects will be availble in the next release of SQL Server -- presently "Denali".
    Thanx for the info, that's good to know. Sheesh, makes me want to play with it already, that and the other cool stuff they seem to be adding.
    Wednesday, September 14, 2011 1:20 PM
    Answerer
  • The Reason: I have to Integrate quiet an amount of data and insert into a database table using SSIS. I have 2 columns(these 2 cols are our point of interest) in the table out of which one is supposed to be an identity col. I want to copy the generated identity values in 1st Col into the second Col before integrating/inserting the data into this destination table(which is not possible with Identity ON on the 1st Col). Instead of inserting AND updating the 2nd Col  i am trying to prepare the data prior to the insert.

    Thanks.

    Wednesday, September 14, 2011 3:00 PM
  • Out of curiosity, why do you need to have 2 columns in the table populated with identical values? 

    If you want to change the value in the second column later, it could make more sense to keep it as null until it's actually changed.


    Thank you!

    My blog: http://aboutsqlserver.com

    Wednesday, September 14, 2011 6:49 PM
  • taking the simple example of a user table below

    Id    User_Id  FirstName  LastName

    1        1            Jill            Smith

    when the user modifies the last name twice ,the data supposed to be inserted is

    2        1           Jill             A

    3        1           Jill             B

    Now inserting a new user

    4        4           Ron           W

    where the user_id 1 or 4 remains constant as unique Id for that user. all the new users get the key of the ID as User_id.Hope this helps...


    • Edited by bala_BT Wednesday, September 14, 2011 8:40 PM
    Wednesday, September 14, 2011 8:40 PM
  • It's still not clear for me why do you want to keep ID = User_ID for the original/first version of the user.

    How do you reference the table - by unique ID or by pair of ID/User_ID?

    In the first case (referencing by ID), I'd something like that:

    create table dbo.Users
    (
    	UserID int not null,
    	ID int not null default identity(1,1)
    	...
    )


    and generate UserId with something like that: http://aboutsqlserver.com/2010/10/03/sunday-t-sql-tip-how-to-generate-pseudo-identity-values/

     

    You also need to think about CI/NCI in such case - it should be either CI on (UserId, ID) and unique NCI on (ID) or vise versa. Depends on your queries.

    If you reference table by the pair of values(User_Id, ID), I'd probably replace ID with the version and don't bother with identity at all. Something like that:

    create table dbo.Users
    (
    	UserID int not null,
    	VersionID int not null 
    		constraint DEF_Users_VersionId
    		default 0,
    
    	constraint PK_Users
    	primary key clustered(UserId, VersionId)
    )
    

    I'd generate UserID the same way as above and increment versionId for the new versions of the row starting with 0.

    Last, but not least, if you don't need to reference anything but last version of the user, I'd keep only the one version in this table and store old versions (history of the changes) in the different place


    Thank you!

    My blog: http://aboutsqlserver.com


    Thursday, September 15, 2011 12:27 AM
  • Thanks Dmitri and all. I got good material for a different thought process. Appreciate It!
    Thursday, September 15, 2011 4:00 PM
  • Dmitri,

     

    A quick question. Just thinking that PK on ( ID,User_ID) is better than PK on (User_Id,Version_Id) because,

    in the former case CI will be based on ID which is a sequential number, hence less fragmentation because it keeps on inserting at the end of the index. where as in later case having CI on User_Id might make the Index fragment more. Suggestions please!

    Just to mention most of the queries access only user_id on which i was planning to have a NCI.

    Thursday, September 15, 2011 4:13 PM
  • Bala,

    It really depends. Price of key lookup with NCI is high especially if you have very large table. It could be acceptable if your queries deal with small # of rows (small # of key lookups from table) although in case of large row sets you'd have performance degradations. Of course, I'm talking about the case when covered indexes are not in the game.

    Unless you have very high amount of inserts and very small amount of selects I would try to go with CI as (User_ID, ID) with perhaps some maintenance plan and fillfactor slightly less than 100%. It should work better because it illuminates key lookups even if it introduces page splits/fragmentation during insert stage. But, obviously, it's good idea to test both cases.


    Thank you!

    My blog: http://aboutsqlserver.com

    Thursday, September 15, 2011 6:30 PM