locked
Need primary key / impact on performance? RRS feed

  • Question

  • Hi,

    I'm not sure whether I 'need' a primary key on my table (described below) and am not sure whether the answer is always 'yes' anyway!  Please bear with me...

    My main data table does not currently have a primary key defined. If I have one it will be a composite of 3 integer fields - ProdCustID, Yr, Period.  ProdCustID is a foreign key against a table of ProdCustIDs where ProdCustID is defined as the Primary Key (and is unique in that table).

    I'm using cascaded referential integrity on ProdCustID so deleting something from the table of ProdCustIDs deletes every matching record from the data table.  Not sure it that's relevant or not.

    My client application uses ADO/OLEDB to do a mixture of reads, updates and inserts against the main data table.  It has to be fast on inserts as well as reads and updates.

    I've got away without a primary key on the data table but now realise that ADO recordsets can sometimes be unable to update records due to ADo/OLEDB deciding it doesn't know which record in the table you want to update with your ado Update() call.  I know that defining a (unique) primary key on the table will resolve this problem but actually I've found a better way of implementing that aspect so it's no longer a 'must-have'.).  On the other hand, being able to confidently use ADO/Update() on the data table would be good.

    So, my question is what effect creating the primary key on the specified fields will have on performance.  I'm concerned that the overhead of updating the resulting index, particularly on inserts, will have a negative impact on performance.

    I hope I've explained the question clearly and will welcome any comments, advice or suggestions!  Many thanks.

    Pete


    Creator of the Software Update Wizard... add update over the web to your app with a single line of code!

    Friday, July 5, 2013 8:46 AM

Answers

  • I thought the issue on inserts on a clustered index on ProdCustID, Period and Year would be that they require a lot of physical reorganisation.

    Also, 99% of queries select all records for the ProdCustID (i.e. require all years/periods in the result).

    I think you are referring to a page split when you mention physical reorganization.  A page split will be required if the existing page does not have enough space for the new row.  You can mitigate page splits by specifying a fill factor when the index is (re)built/reorganized and reorg periodically.  In this case, it might be most appropriate to reorg after all ProdCustID rows have been inserted for a particular period (assuming Period/Year is incremental).  The fill factor value should ideally be large enough to accommodate the percentage of inserts between reorgs (e.g. 90 if the number of rows increases by 10%).

    Keep in mind that inserts happen only once but selects usually occur much more frequently.  So it is usually best to optimize indexes for selects rather than inserts and use techniques like fill factor for fine tuning.


    Dan Guzman, SQL Server MVP, http://www.dbdelta.com

    • Proposed as answer by Kalman Toth Friday, July 5, 2013 10:22 PM
    • Marked as answer by Fanny Liu Friday, July 12, 2013 3:00 PM
    Friday, July 5, 2013 2:35 PM
    Answerer
  • >> My main data table does not currently have a PRIMARY KEY defined. If I have one it will be a composite of 3 integer fields [sic] - prod_cust_id, Yr, Period. prod_cust_id is a foreign key against a table of Prod_Cust_IDs where prod_cust_id is defined as the PRIMARY KEY (and is unique in that table). <<

    WRONG! No primary key means that this is not a table; this is a deck of 1950's punch cards written in T-SQL. Columns are nothing like fields; more punch card lingo! 

    The data element name “prod_cust_id” is weird. A Customer is an entity, so that set needs needs a table. Products are entities, they also need a table. But what the heck is a “Prod_Cust”? Next, an identifier is an attribute property; the table cannot be named “Prod_Cust_Ids” in a valid schema. 

    Sure wish you had the manners to follow Netiquette and post the DDL. 

    >> I'm using cascaded referential integrity on prod_cust_id so deleting something from the table of Prod_Cust_IDs [sic] deletes every matching record [sic] from the data table. Not sure it that's relevant or not. <<

    More bad design. This is as if you created a table “Automobile_Squids_wgt”; Two separate entities and an attribute crammed into one absurd table. 

    >> My client application uses ADO/OLEDB to do a mixture of reads, updates and inserts against the main data table. It has to be fast on inserts as well as reads and updates. <<

    We are the database guys; we do not care about input and presentation layers. 

    >> I've got away without a PRIMARY KEY on the data table but now realize that ADO recordsets can sometimes be unable to update records [sic] due to ADO/OLEDB deciding it doesn't know which record [sic] in the table you want to update with your ado UPDATE() call. <<

    This probably means that the whole database is a scramble mess without data integrity. How soon can you do a data audit? If you will follow Netiquette and post DLL, maybe we can help you. I have tried to clean “punch card code” before and it eats a lot of time. 

    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    • Proposed as answer by Van Dinh Saturday, July 6, 2013 3:13 AM
    • Marked as answer by Fanny Liu Friday, July 12, 2013 3:01 PM
    Friday, July 5, 2013 4:23 PM
  • Did you actually measure the impact of different scenarios (performance in terms of time and logical IO, fragmentation increase for different fill factors, size increase with additional indexes) ?

    Assumption is the mother of all errors. Measurement will give you the truth.

    It is almost always better to add a surrogate PK like Mr. Kalman suggested, than having a natural PK consisting of multiple columns already there. There are several reasons: PK columns cannot be nullable, and some day you may want one of the "natural" columns to become nullable. Also, it is easier and more space-efficient to reference just one integer than set of wide columns. If that set of columns change one day, you have to make changes to all child tables, but with surrogate PK you don't have that problem. To put it short, it is much more flexible to create a surrogate PK (e.g. int identity). You can (and should) create a UNIQUE constraint on natural PK columns. Creating constraints (PK, UNIQUE, foreign key, not null) is a part of the logical design and they SHOULD be created as a general rule (there are exceptions, but they are very very very rare, almost non-existant). If constraints and indexing is properly set, performance will not suffer. There are people that argue not to use the constraints because they are "bad for performance", but they should stay away from the databases and do something they are good at.

    Primary key does not have to be clustered, as Mr. Kalman showed in the script. Primary key and the clustered index can have a completely different set of key columns.

    There is a big performance "gotcha" with foreign keys. Each foreign key column should be indexed (for a big table). If you don't, DELETE of a single parent row will take like forever, because SQL internally must check to see if there are any child records. If there is no suitable index on a child table, SQL will fulll-scan. For large tables it can take a VERY long time. Just a single parent row delete will have to scan zillions (all) of rows of the child table. So, be careful to create those indexes. If the child table is small, or parent row will never be deleted, you don't have to create them. In your case, those indexes are needed and missing.

    Inserts will be faster if they are comming to the end of the (ever-increasing) clustered index thus not making a page split - another plus for surrogate identity PK that is clustered. Clustered key should ideally also be unique, narrow, fixed-length, and not-null. Updates will be faster if they do not make a row wider (lowering a fillfactor may help if they DO make a row wider). Deletes will be faster if all large child tables have indexes on their foreign key columns that reference the table you delete from. Selects can benefit from certain indexes, but too many indexes will slow-down ins/upd/del operations.

    There are many factors affecting performance, but without measuring it is just a twaddle and assumptions that may be good but also may prove wrong after the measurement (on a real production-like workload).
    • Proposed as answer by Kalman Toth Saturday, July 6, 2013 7:34 AM
    • Marked as answer by Fanny Liu Friday, July 12, 2013 3:01 PM
    Saturday, July 6, 2013 12:01 AM

All replies

  • Friday, July 5, 2013 9:23 AM
    Answerer
  • Please refer the below link, it explains clearly.

    http://stackoverflow.com/questions/5509330/sql-server-insert-performance-with-and-without-primary-key


    Regards, RSingh

    I wouldn't say 'clearly' necessarily throughout!  However, my interpretation of it is that my primary key option will slow down inserts.  As SELECT queries are already pleasingly fast my conclusion is therefore that adding the primary key I described in my OP would be a negative move.  Agreed?

    Thanks for taking the trouble to respond - appreciated!

    Pete


    Creator of the Software Update Wizard... add update over the web to your app with a single line of code!

    Friday, July 5, 2013 9:41 AM
  • Adding a primary Key constraint will create a Clustered Index on the table ( except if one already exists or explicitly mentioned 'Non clustered'). It will make the rows of the table organized (logically).  With this the update/delete/insert would be faster as the data is organized in terms of Clustered index key.

    Hope this clarifies. 

    As , no information about the table/columns/keys are provided , cannot say mcuh.

    Friday, July 5, 2013 11:23 AM
  • Thanks!  This is the current definition.  I'm thinking about the Primary key on pdata_1_PCYT on fields ProdCustID, Period and Year.

    CREATE TABLE [dbo].[proph_PClookup](
    [ProdCustID] [int] NOT NULL,
    [ProdCode] [varchar](255) NOT NULL,
    [CustCode] [varchar](255) NOT NULL,
    PRIMARY KEY CLUSTERED 
    (
    [ProdCustID] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
    ) ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF
    GO

    ALTER TABLE [dbo].[proph_PClookup]  WITH CHECK ADD  CONSTRAINT [proph_fkeyCUSTHIER] FOREIGN KEY([CustCode])
    REFERENCES [dbo].[proph_Cust] ([custcode])
    ON UPDATE CASCADE
    ON DELETE CASCADE
    GO

    ALTER TABLE [dbo].[proph_PClookup] CHECK CONSTRAINT [proph_fkeyCUSTHIER]
    GO

    ALTER TABLE [dbo].[proph_PClookup]  WITH CHECK ADD  CONSTRAINT [proph_fkeyPRODHIER] FOREIGN KEY([ProdCode])
    REFERENCES [dbo].[proph_Prod] ([prodcode])
    ON UPDATE CASCADE
    ON DELETE CASCADE
    GO

    ALTER TABLE [dbo].[proph_PClookup] CHECK CONSTRAINT [proph_fkeyPRODHIER]
    GO

    CREATE TABLE [dbo].[proph_1_PCYT](
    [ProdCustID] [int] NOT NULL,
    [Period] [smallint] NOT NULL,
    [Yr] [smallint] NOT NULL,
    [Qty] [float] NULL,
    etc.. (more floats and a varchar(4000))
    ) ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF
    GO

    ALTER TABLE [dbo].[proph_1_PCYT]  WITH CHECK ADD  CONSTRAINT [proph_1_PCYT_fkeyPC] FOREIGN KEY([ProdCustID])
    REFERENCES [dbo].[proph_PClookup] ([ProdCustID])
    ON UPDATE CASCADE
    ON DELETE CASCADE
    GO

    ALTER TABLE [dbo].[proph_1_PCYT] CHECK CONSTRAINT [proph_1_PCYT_fkeyPC]
    GO

    I thought the issue on inserts on a clustered index on ProdCustID, Period and Year would be that they require a lot of physical reorganisation.

    Also, 99% of queries select all records for the ProdCustID (i.e. require all years/periods in the result).

    Does that change your advice?  Thanks.

    (Apologies - maybe my original post should have contained the above definitions.)

    Pete


    Creator of the Software Update Wizard... add update over the web to your app with a single line of code!

    Friday, July 5, 2013 12:26 PM
  • May be not, The clustered Index would be a better idea.

    The physical reorganization would happen during the creation of the clustered index ( one time activity).

    If the columns of the CI key are changing frequently , then you may need to rethink on the key columns and might need frequent Index maintenance activity.

    Friday, July 5, 2013 1:28 PM
  • I thought the issue on inserts on a clustered index on ProdCustID, Period and Year would be that they require a lot of physical reorganisation.

    Also, 99% of queries select all records for the ProdCustID (i.e. require all years/periods in the result).

    I think you are referring to a page split when you mention physical reorganization.  A page split will be required if the existing page does not have enough space for the new row.  You can mitigate page splits by specifying a fill factor when the index is (re)built/reorganized and reorg periodically.  In this case, it might be most appropriate to reorg after all ProdCustID rows have been inserted for a particular period (assuming Period/Year is incremental).  The fill factor value should ideally be large enough to accommodate the percentage of inserts between reorgs (e.g. 90 if the number of rows increases by 10%).

    Keep in mind that inserts happen only once but selects usually occur much more frequently.  So it is usually best to optimize indexes for selects rather than inserts and use techniques like fill factor for fine tuning.


    Dan Guzman, SQL Server MVP, http://www.dbdelta.com

    • Proposed as answer by Kalman Toth Friday, July 5, 2013 10:22 PM
    • Marked as answer by Fanny Liu Friday, July 12, 2013 3:00 PM
    Friday, July 5, 2013 2:35 PM
    Answerer
  • >> My main data table does not currently have a PRIMARY KEY defined. If I have one it will be a composite of 3 integer fields [sic] - prod_cust_id, Yr, Period. prod_cust_id is a foreign key against a table of Prod_Cust_IDs where prod_cust_id is defined as the PRIMARY KEY (and is unique in that table). <<

    WRONG! No primary key means that this is not a table; this is a deck of 1950's punch cards written in T-SQL. Columns are nothing like fields; more punch card lingo! 

    The data element name “prod_cust_id” is weird. A Customer is an entity, so that set needs needs a table. Products are entities, they also need a table. But what the heck is a “Prod_Cust”? Next, an identifier is an attribute property; the table cannot be named “Prod_Cust_Ids” in a valid schema. 

    Sure wish you had the manners to follow Netiquette and post the DDL. Makes life so much easier .. 

    >> I'm using cascaded referential integrity on prod_cust_id so deleting something from the table of Prod_Cust_IDs [sic] deletes every matching record [sic] from the data table. Not sure it that's relevant or not. <<

    More bad design. This is as if you created a table “Automobile_Squids_wgt”; Two separate entities and an attribute crammed into one absurd table. 

    >> My client application uses ADO/OLEDB to do a mixture of reads, updates and inserts against the main data table. It has to be fast on inserts as well as reads and updates. <<

    We are the database guys; we do not care about input and presentation layers. 

    >> I've got away without a PRIMARY KEY on the data table but now realize that ADO recordsets can sometimes be unable to update records [sic] due to ADO/OLEDB deciding it doesn't know which record [sic] in the table you want to update with your ado UPDATE() call. <<

    This probably means that the whole database is a scramble mess without data integrity. How soon can you do a data audit? If you will follow Netiquette and post DLL, maybe we can help you. I have tried to clean “punch card code” before and it eats a lot of time. 

    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    Friday, July 5, 2013 4:23 PM
  • >> My main data table does not currently have a PRIMARY KEY defined. If I have one it will be a composite of 3 integer fields [sic] - prod_cust_id, Yr, Period. prod_cust_id is a foreign key against a table of Prod_Cust_IDs where prod_cust_id is defined as the PRIMARY KEY (and is unique in that table). <<

    WRONG! No primary key means that this is not a table; this is a deck of 1950's punch cards written in T-SQL. Columns are nothing like fields; more punch card lingo! 

    The data element name “prod_cust_id” is weird. A Customer is an entity, so that set needs needs a table. Products are entities, they also need a table. But what the heck is a “Prod_Cust”? Next, an identifier is an attribute property; the table cannot be named “Prod_Cust_Ids” in a valid schema. 

    Sure wish you had the manners to follow Netiquette and post the DDL. 

    >> I'm using cascaded referential integrity on prod_cust_id so deleting something from the table of Prod_Cust_IDs [sic] deletes every matching record [sic] from the data table. Not sure it that's relevant or not. <<

    More bad design. This is as if you created a table “Automobile_Squids_wgt”; Two separate entities and an attribute crammed into one absurd table. 

    >> My client application uses ADO/OLEDB to do a mixture of reads, updates and inserts against the main data table. It has to be fast on inserts as well as reads and updates. <<

    We are the database guys; we do not care about input and presentation layers. 

    >> I've got away without a PRIMARY KEY on the data table but now realize that ADO recordsets can sometimes be unable to update records [sic] due to ADO/OLEDB deciding it doesn't know which record [sic] in the table you want to update with your ado UPDATE() call. <<

    This probably means that the whole database is a scramble mess without data integrity. How soon can you do a data audit? If you will follow Netiquette and post DLL, maybe we can help you. I have tried to clean “punch card code” before and it eats a lot of time. 

    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    • Proposed as answer by Van Dinh Saturday, July 6, 2013 3:13 AM
    • Marked as answer by Fanny Liu Friday, July 12, 2013 3:01 PM
    Friday, July 5, 2013 4:23 PM
  • Here is another alternative especially if this table is referenced:

    CREATE TABLE [dbo].[proph_1_PCYT](
    PCTID INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,
    [ProdCustID] [int] NOT NULL,
    [Period] [smallint] NOT NULL,
    [Yr] [smallint] NOT NULL,
    UNIQUE CLUSTERED (ProdCustID, Period, Yr),
    [Qty] [float] NULL,
    etc.. (more floats and a varchar(4000))
     ) 


    Kalman Toth Database & OLAP Architect sqlusa.com
    New Book / Kindle: Exam 70-461 Bootcamp: Querying Microsoft SQL Server 2012

    Friday, July 5, 2013 10:26 PM
  • Did you actually measure the impact of different scenarios (performance in terms of time and logical IO, fragmentation increase for different fill factors, size increase with additional indexes) ?

    Assumption is the mother of all errors. Measurement will give you the truth.

    It is almost always better to add a surrogate PK like Mr. Kalman suggested, than having a natural PK consisting of multiple columns already there. There are several reasons: PK columns cannot be nullable, and some day you may want one of the "natural" columns to become nullable. Also, it is easier and more space-efficient to reference just one integer than set of wide columns. If that set of columns change one day, you have to make changes to all child tables, but with surrogate PK you don't have that problem. To put it short, it is much more flexible to create a surrogate PK (e.g. int identity). You can (and should) create a UNIQUE constraint on natural PK columns. Creating constraints (PK, UNIQUE, foreign key, not null) is a part of the logical design and they SHOULD be created as a general rule (there are exceptions, but they are very very very rare, almost non-existant). If constraints and indexing is properly set, performance will not suffer. There are people that argue not to use the constraints because they are "bad for performance", but they should stay away from the databases and do something they are good at.

    Primary key does not have to be clustered, as Mr. Kalman showed in the script. Primary key and the clustered index can have a completely different set of key columns.

    There is a big performance "gotcha" with foreign keys. Each foreign key column should be indexed (for a big table). If you don't, DELETE of a single parent row will take like forever, because SQL internally must check to see if there are any child records. If there is no suitable index on a child table, SQL will fulll-scan. For large tables it can take a VERY long time. Just a single parent row delete will have to scan zillions (all) of rows of the child table. So, be careful to create those indexes. If the child table is small, or parent row will never be deleted, you don't have to create them. In your case, those indexes are needed and missing.

    Inserts will be faster if they are comming to the end of the (ever-increasing) clustered index thus not making a page split - another plus for surrogate identity PK that is clustered. Clustered key should ideally also be unique, narrow, fixed-length, and not-null. Updates will be faster if they do not make a row wider (lowering a fillfactor may help if they DO make a row wider). Deletes will be faster if all large child tables have indexes on their foreign key columns that reference the table you delete from. Selects can benefit from certain indexes, but too many indexes will slow-down ins/upd/del operations.

    There are many factors affecting performance, but without measuring it is just a twaddle and assumptions that may be good but also may prove wrong after the measurement (on a real production-like workload).
    • Proposed as answer by Kalman Toth Saturday, July 6, 2013 7:34 AM
    • Marked as answer by Fanny Liu Friday, July 12, 2013 3:01 PM
    Saturday, July 6, 2013 12:01 AM
  • Just wow. How can you even consider having a table in your schema with no primary key? You missed 2NF. I could make a killing re-designing jeep SQL programmers that designed databases. And I do. Thanks

    Hope this is Helpfull. J Alverson Microsoft BI Solutions Please use Marked as Answer if my post solved your problem and use Vote As Helpful if a post was useful.

    Monday, July 8, 2013 1:46 AM
  • The basic answer is that yes of course you need a primary key for a dozen reasons.

    Does this table have other indexes?  Does it already have a clustered index?  Does it have one or more unique indexes?

    SQL Server just expects you to have a clustered index, and because it uses the clustered index to implement the non-clustered indexes, you want to keep the clustered index - which is usually the primary key - small.  And so you see all over SQL Server land people putting surrogate primary keys on their tables, just identity fields with no real meaning.  In pure relational theory this is crufty.  In SQL Server land, it's probably best practice.  Such is life.

    How much data are we talking about anyway, and what kind of performance do you have now?  Is there a chance that the primary key may boost performance dramatically?  There can be all kinds of implications regarding locking and blocking, what about those?

    In short, it's far, far better to just do it, in about 99% of cases.

    Josh


    • Edited by JRStern Monday, July 8, 2013 2:08 AM typos
    Monday, July 8, 2013 2:07 AM
  • Having a primary key on the table is important for many reasons already stated; however, the primary key is NOT required to be clustered. While there are some trade-offs (as stated by others), a non-clustered PK would give you the functionality needed to support the different middle-ware, protect data integrity and provide minimize the performance impact on tables with a high frequency of updates and deletes.

    Simple maintenance jobs can be configured to automatically reorganize the PK (based on fragmentation levels) or cleaning up forward records (if any).

    Bill


    William F. Kinsley

    Tuesday, July 9, 2013 8:28 PM