none
Identity column primary key Vs Custom Key

    Question

  • We are developing a batabase which is meant for financial domain,so it will import data from different source system..
    and data from our data base will be further passed to other applications.

    In contex of our system integration with other data sources ,whether  is it a good idea to have a auto integer primary key a or  to implement  some  logic  to generate primary key?

    Can some one guide us to some pratical data base design case studies?or some best practices.?


    Thursday, May 22, 2008 3:26 PM

Answers

  • I see the delimma more clearly as to what you are trying to accomplish.  How are you planning to merge the remote databases to the central database, merge replication?  In a merge replication you can assign identity ranges that will alliviate any problems with mixed keys.

     

    Take a look at this article and read up on merge replication

    http://msdn.microsoft.com/en-us/library/ms152543.aspx

    Friday, May 23, 2008 5:41 AM
    Moderator
  •  Gagan Kr Sharma wrote:

    In contex of our system integration with other data sources ,whether  is it a good idea to have a auto integer primary key a or  to implement  some  logic  to generate primary key?


    Auto integer primary key (surrogate key) is the accepted practice with SQL server databases.  It has the advantage of supporting high performance JOINs when used as PK and FK. For example, joining on emal (varchar(70) ) is much slower than joining on EmailID (int).

    You can have the following:

    IDENTITY(smallint, 1,1) - small tables

    IDENTITY(int, 1,1) - large tables

    IDENTITY(bigint, 1,1) - tables over 2 billion rows

    If you expect specific insertions you can use IDENTITY(int, 1, 5) which leaves 4 slots empty.

    Independently, you should have a Natural Key in each table which can be enforced by unique index/contraints or some other way like trigger or application.

    In AdventureWorks Production.Product has

    Code Snippet

    ProductID int identity(1,1) primary key

    while Name (product name) is a natural key with unique index.

    ProductNumber with unique index is an alternate key created by business users.

    Sometimes hard to define a unique natural key (see Sales.SalesOrderDetail business transaction table), in such a case the application enforces uniqueness.



    Kalman Toth Database & OLAP Architect
    Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012


    Friday, May 23, 2008 7:44 AM
    Moderator

All replies

  • A primary key is a surrogate key that is used to uniquely identify a row.  It makes no difference whether the key is an integer or some custom key. Maintaining and creatling logic to create a unique key can be troublesome and complex. An identity column is a simplisitic method to identifiy rows and requires no "additive" code logic.  A basic rule of thumb is if a person cannot identify a row, without the use of a surrogate key, you need to change your design. 

     

    This article is really good and may help you out.

    http://www.simple-talk.com/sql/database-administration/ten-common-database-design-mistakes/

     

    Thursday, May 22, 2008 3:43 PM
    Moderator
  • Hi,Adam
                 Thx for your reply, but consider the  scenario  where  our application has been deployed to different branches in distributed mode of same Bank in a country.(I.e Database instanse is limited to each branch) .
    Now we need to merge data from all these installations to a central database,In this  case what I feel as a potential
    problem to have same primary keys associated with different customers of different branch (if I take auto Integer as primary
    key.
    While If we implement some custom logic to generate primary keys ,then we can append some branch identifier in each primary key so that customer from diffrent branches can be identified uniquely when data merging happens.

    What opinion do you have regarding above written logic.


    Friday, May 23, 2008 5:14 AM
  • composite primary key may help.

     

    Add another column to identify different branches.

    And make it together with existing primary key to get new composite primary key.

     

     

     

    Friday, May 23, 2008 5:31 AM
  • I see the delimma more clearly as to what you are trying to accomplish.  How are you planning to merge the remote databases to the central database, merge replication?  In a merge replication you can assign identity ranges that will alliviate any problems with mixed keys.

     

    Take a look at this article and read up on merge replication

    http://msdn.microsoft.com/en-us/library/ms152543.aspx

    Friday, May 23, 2008 5:41 AM
    Moderator
  • Yes ,that seems to be a solution...
    Friday, May 23, 2008 6:31 AM
  • Although I may not be 100 % correct but that seems to be a partial solution in our case...I think it is better to have primary key having some part of it devoted as branch identifier rather than  mantaining a column for branch since this composite primary key will serve as foreign key in child tables  and will make update,Insert relatively  tedious.

    What do u think?


    Friday, May 23, 2008 6:39 AM
  • Why would it make a difference in update/insert/delete?  If a record is written at a remote branch and given the id 5000, this id is then replicated back to the central database.  If any subsequent records are created for id 5000 in foreign tables, those too are written back to the central database.  The update/insert/delete would occur normally, aside from the fact that the identity seed will be derived from a range you specifiy.

     

    We need to step back and think about what you plan on doing with this key.  I do not know your database design but it seems that your database is not in normal form.  If you have to rely on a surrogate key to identify a branch, something is not quite right with the design. 

    Friday, May 23, 2008 7:00 AM
    Moderator
  • Hi Adams , I had forgotten to mention the name in my earlier reply but opinion expressed was in context of Zuomin's thoughts.
    Your solution is reasonably good.

    Friday, May 23, 2008 7:08 AM
  • Okay, sometimes it is difficult to determine who a post is addressed too.

     

    Friday, May 23, 2008 7:12 AM
    Moderator
  •  Gagan Kr Sharma wrote:

    In contex of our system integration with other data sources ,whether  is it a good idea to have a auto integer primary key a or  to implement  some  logic  to generate primary key?


    Auto integer primary key (surrogate key) is the accepted practice with SQL server databases.  It has the advantage of supporting high performance JOINs when used as PK and FK. For example, joining on emal (varchar(70) ) is much slower than joining on EmailID (int).

    You can have the following:

    IDENTITY(smallint, 1,1) - small tables

    IDENTITY(int, 1,1) - large tables

    IDENTITY(bigint, 1,1) - tables over 2 billion rows

    If you expect specific insertions you can use IDENTITY(int, 1, 5) which leaves 4 slots empty.

    Independently, you should have a Natural Key in each table which can be enforced by unique index/contraints or some other way like trigger or application.

    In AdventureWorks Production.Product has

    Code Snippet

    ProductID int identity(1,1) primary key

    while Name (product name) is a natural key with unique index.

    ProductNumber with unique index is an alternate key created by business users.

    Sometimes hard to define a unique natural key (see Sales.SalesOrderDetail business transaction table), in such a case the application enforces uniqueness.



    Kalman Toth Database & OLAP Architect
    Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012


    Friday, May 23, 2008 7:44 AM
    Moderator
  • Just use a uniqueidentifier as the PK and be done with it.

     

    Personally, I always use an IDENTITY column with a 10 increment. If I field more servers, I have 9 other seed values I can use. Of course, when I need to add that 11th server instance, I'm screwed. I figure, however, if I need 11 servers, I'll know long beforehand and plan accordingly.

     

    Friday, May 23, 2008 10:59 PM
  •  SQLUSA wrote:

    You can have the following:

     

    IDENTITY(smallint, 1,1) - small tables

    IDENTITY(int, 1,1) - large tables

    IDENTITY(bigint, 1,1) - tables over 2 billion rows

     

    Actually, the bigint identity column will support over 9 quintillion rows (9,223,372,036,854,775,807 to be exact). Twice that if your seed is -9,223,372,036,854,775,808. Assuming an increment of 1, of course.

     

    It is the int that will support over 2 billion (2,147,483,647). In SQL server, an int is 32 bits.

     

    Friday, May 23, 2008 11:04 PM
  • I can't believe I'm about to recommend this but:

     

    Use a GUID.

     

    Now, before you all scream at me for suggesting such a thing (and I know there are some disadvantages), you can create a GUID in such a way that it will simulteously identify 1) Originating system and 2)Time of origination and you can do it in such a way that the guid will be sequential and thus suitable for a  PK (though I know they're large).

     

     

    select convert(uniqueidentifier,convert(binary(8), newid()) + convert(binary(2), 10) + convert(binary(6), getutcdate()))

     

    The "10" in the above here indicates the ID for the branch.  Each brach should be given its own id, 0-255.

    Code Snippet

    CREATE TABLE dbo.Table_1

    (

    ID uniqueidentifier NOT NULL,

    SomeName nvarchar(255) NOT NULL,

    SomeOtherValue int NOT NULL,

    ADateOfSomeSort datetime NOT NULL

    ) ON [PRIMARY]

    GO

    ALTER TABLE dbo.Table_1 ADD CONSTRAINT

    DF_Table_1_ID DEFAULT convert(uniqueidentifier,convert(binary(8), newid()) + convert(binary(2), 10) + convert(binary(6), getutcdate())) FOR ID

    GO

    ALTER TABLE dbo.Table_1 ADD CONSTRAINT

    PK_Table_1 PRIMARY KEY CLUSTERED

    (

    ID

    ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

     

     

    This techinque will create a GUID something like:

    186FAEC8-5EB0-43E1-000A-9AA501871057

     

    the right 12 characters here can be parsed like this:

    0x9AA5 This is the HEX value for "Days since 01 Jan 1900" in this case, 39589

    0x01871057 this is the HEX value for "3/1000 of a second since Midnight this morning"

    select convert(datetime, 0x9AA501871057) gives you the exact time this record was created

    The "000A" above indicates the Branch where the record was created. 

    The remaingin characters are all randomly generated to assure uniqueness.

     

    Because GUIDs sort based first on the right(12) characters, these guids will be created in order and becuase eerything is based on GMT time, all guids (regardless of timezone) will be created in a "syncronized" fashion.

     

    So, chew on THAT one.

     

    Cheers

     

     

     

    Friday, May 23, 2008 11:53 PM
  • Hi rusag2,
     Do you have some idea about performance comparison of Interger primary key & GUID as primary key in terms of Insert/update/delete?
    Tuesday, May 27, 2008 8:37 AM
  •  

    This will slow down the inserts, no doubt.  Generating GUIDs requires overhead. 

     

    As far as a PK, the GUID is bigger.  It's 16 bytes.  That's pretty big compared to INT or even BIGINT.

     

    However, for a real apples-to-apples comparison, if you use the method I have outlined, you will not have to insert a "CreateDate" for any records, nor will you need a "System Of Origin" field.  Both of these pieces of information will be encoded in the GUID.

     

    So, while you're using 16 bytes for the GUID, 2 of those are for system of origin and 6 are for the DateTime of CreateDate.  That's 8 bytes saved (assuming that you would have stored this same information elsewhere) along with  overhead for two additional fields.

     

    If you want the best of both worlds, create a GUID table like this:

    tbGUID

    RowID INT identity(1,1)

    GUID uniqueidentifier

     

    Then, generate a GUID in this table but use the RowID value eslewhere within the DB.  YOu can export/import data using the GUID value and refer to the record by the INT value (within that one system).

     

    Hope this helps

    Cheers

    Tuesday, May 27, 2008 9:47 PM
  •  

    Encoded in those GUIDS are two things:

    1.  Exact Time the guid was created (in GMT) in this case,

    you created the first GUID at exactly 2008-05-28 11:04:34.373 GMT

    and the last GUID was created at exactly 2008-05-28 11:04:40.780 GMT

    2.  System of Origination (in this case 0x0A or system #10)

     

    So, there are two things we know from the GUID that we would not know from a simple INT.

     

    In addition, any of these records may be moved or copied to a new database on a new server and the ID value need not change.  The "object" represented by this GUID is unique acrross all systems and platforms, so the various applications and databases and servers can freely copy/move data back and forth with ZERO potential confusion.

     

     

    Wednesday, May 28, 2008 6:25 PM
  • Of course, INTs are preferred in most circumstances.

     

    But in this case, with lots of records going from one DB to another, it is critical to make sure you can uniquely identify each record.  While this can be done as suggested above in AdventureWorks, that solution will do in 20 bytes what my solution proposes in 16 bytes.

     

    Plus, my solution has the added advantage of being much easier to code because TableName.ID will always be unique therefore, you can insert and update accross multiple databases and projects and the record is always easily identifiable, not only as an unique record, but also the system of origination and the date/time of origination.

     

    There is a marginal cost to pay because a 16byte guid is larger than a 4 byte int.  However, since this field would be the PK, that hit is not all that significant because SQL will use the index to quickly find the value, rather than to read each and every value in the table.  So, yes there is a performance hit, but it is small, and in this particular circumstance, that may be a fair price to pay to guarantee uniqueness and to facilitate faster builds and implementaions.

     

    In any event, it's well worth investigating.  FYI, in our enterprise we're using a similar (though not identical) strategy and have TB databases with 2 billion+ records.  Yes, there is a performance it.  In our case about 15% slower.  But, we're provided with a streamlined interface to the data and can easily move data between projects and servers and this saves us a great deal of time/money in dev costs and we're able to field updates much more quickly than if we had to create more elaborate conversions to maintain a unique record.

     

     

    Wednesday, May 28, 2008 7:52 PM
  • There is no data duplication accross the enterprise!  Whew.  That would be very ugly indeed.  Each GUID is unique....globally unique so there is no chance of overlap.

     

    BTW, the 2BB number was records in a single table, with many many tables over the 1BB mark.

    Wednesday, May 28, 2008 9:46 PM