locked
Primary Key Design Practices RRS feed

  • Question

  • Hello Everyone,

    So I have been educating myself on SQL design practices. I have spent a lot of time going over various sites detailing Primary Key design practices. But after reading a lot of documentation and thinking about it, I feel like there is somewhat of a disconnect in the general Primary Key design practices discussion and the conclusions I reach after reading. I am trying to determine if I simply have not seen a discussion that takes this into account of if there are aspects of SQL that I simply am not grasping. Sorry for the long post, but I wanted to be sure I communicated things adequately.

    I feel the disconnect arises because the discussion of Primary Keys is muddied by clustered indexes, specifically the practice of automatically creating a clustered index on the Primary Key. It seems to me like the decision of what a clustered index should use should be independent of Primary Key selection. Perhaps that is because of a gap in my knowledge, but here is an example that illustrates how I arrived at that conclusion.

    ASP.NET offers login facilities with SQL DBs used for database storage. The Table aspnet_Users stores all the Users ASP.NET Recognizes. The Primary Key of that table is UserId a UNIQUEIDENTIFIER. Now if I want to tie additional information to that user, my first inclination is to create a new table referencing the UserId as a foreign key. However, that strikes me as a fairly wasteful proposition. Every reference is going to require a UNIQUEIDENTIFIER which increases the amount of space needed and has performance implications. This seems to be a bad choice if I can configure my ASP.NET application to use an IDENTITY value instead to reference a user after login.

    So to that end I want to create a DB:

    AspUserLookup

    AspUserLookupID SMALLINT IDENTITY(-32768, 1)

    UserId UNIQUEIDENTIFIER UNIQUE REFERENCES aspnet_Users(UserId) ON DELETE CASCADE

    Which leaves the questions of what my Primary Key should be and what my clustered index should be on. Now the advice I read in most places suggests AspUserLookupID is the obvious choice for Primary Key and clustered index. But to me that feels wrong. I think undoubtedly the clustered index needs to be on AspUserLookupID since all the references to the table will use that and not the UserId. Performance should be much better than clustering on UserId. But it does not make sense to me to make AspUserLookupID the Primary Key. In this case, AspUserLookupID is a surrogate key that exists purely for performance reasons. If there was no performance difference between UNIQUEIDENTIFIER and SMALLINT IDENTITY, there would be no reason to have AspUserLookupID at all. aspnet_Users.UserId is a great Natural Key in that it uniquely identifies the user in the ASP.NET login framework. So because a Natural Key exists, I am inclined to make it the Primary Key since it provides more meaningful information than using the meaningless IDENTITY field as the Primary Key:

    CREATE TABLE AspUserLookup (
    AspUserLookupID SMALLINT IDENTITY (-32768, 1) UNIQUE CLUSTERED,
    UserID UNIQUEIDENTIFIER PRIMARY KEY REFERENCES dbo.aspnet_Users(UserId) ON DELETE CASCADE
    )

    At this point, my logic says the following guideline makes sense in the majority of cases:

    - Always consider the question of clustered indexing independently of Primary Key selection (even though they often end up being the same)

    - If a surrogate key exists solely for performance reasons because of a wide Natural Key (such as a UNIQUEIDENTIFIER, VARCHAR, or multi-column key), make the surrogate key the clustered index but make the Natural Key the Primary Key. This offers maximum performance while still easily constraining entries based on the uniqueness guaranteed by the Natural Key

    So given the novel above, I had a few questions:

    1.) Is there some gap in my knowledge that makes my guideline regarding Primary Key selection flawed?

    2.) Is there a better alternative to my guideline for selecting Primary Key and clustered indexes?

    3.) If you agree with my design guideline, what factors could potentially prove an exception to the rule?


    Friday, November 23, 2012 7:22 PM

Answers

  • >If a surrogate key exists solely for performance reasons because of a wide Natural Key (such as a UNIQUEIDENTIFIER, VARCHAR, or multi-column key), make the surrogate key the clustered index but make the Natural Key the Primary Key. This offers maximum performance while still easily constraining entries based on the uniqueness guaranteed by the Natural Key

    Never make the NATURAL KEY the PRIMARY KEY.  (You must have a absolute airtight case to violate this rule.)

    Make the NATURAL KEY UNIQUE KEY clustered or nonclustered.

    If you make the WIDE NATURAL KEY the PRIMARY KEY, the FOREIGN KEYs must be WIDE also, and you are spreading table content from the PK table to the FK tables, a subtle violation of 3NF rules. "Elvis Presley" does not make a good FOREIGN KEY, it makes a good rock singer.

    Design example:

    CREATE TABLE Celebrity(
    	CelebrityID INT IDENTITY(1,1) PRIMARY KEY nonclustered,
    	LastName varchar(32) NOT NULL,
    	FirstName varchar(32) NOT NULL,
    	UNIQUE clustered (LastName, FirstName),
    	Birthdate DATE NOT NULL,
    	Profession varchar(64) NOT NULL,
    	CreateDate date default (CURRENT_TIMESTAMP));
    GO

    To decide about where to put the clustered index, you have to decide first about business critical queries.

    CLUSTERED index should be used to support business critical queries.

    Frequently that means the NATURAL KEY (UNIQUE KEY) should have the clustered index, but not always.

    >- Always consider the question of clustered indexing independently of Primary Key selection (even though they often end up being the same)

    Yes, clustered index is a "luxury" item in optimization, therefore is should be rewarded to business critical queries.

    Yes, frequently it ends up the same because clustered is the default for PRIMARY KEY.  Easy to remember: "PRIMARY KEY nonclustered" is the alternate term. 

    >So because a Natural Key exists, I am inclined to make it the Primary Key since it provides more meaningful information than using the meaningless IDENTITY field as the Primary Key:

    This part I don't understand: uniqueidentifier a meaningful natural key? What is meaningful about it? More meaningful than IDENTITY?

    It does not matter. You want the PRIMARY KEY to be meaningless. That way you never have to change it.

    CREATE TABLE AspUserLookup (
     AspUserLookupID SMALLINT IDENTITY (-32768, 1) PRIMARY KEY nonclustered,
     UserID UNIQUEIDENTIFIER UNIQUE CLUSTERED REFERENCES dbo.aspnet_Users(UserId) ON DELETE CASCADE
     );

    There is an optimization problem with your FOREIGN KEY: it is 16 bytes as opposed to the usual INT (4 bytes) FK. It may be OK, but you must have a good explanation why 16 bytes necessary for something doable in 4 bytes? The extra 12 bytes in each row is an unnecessary performance overhead.


    Kalman Toth SQL 2008 GRAND SLAM
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012












    • Edited by Kalman Toth Friday, November 23, 2012 8:39 PM
    Friday, November 23, 2012 7:51 PM
  • >1.) What is the rationale behind (almost) never making a Natural Key the Primary Key?

    Just look at the Celebrity table. Do you want "Frank" "Sinatra" as double column FOREIGN KEY in tables referencing the Celebrity table? Or in a Product table, "Mountain Bike 200 Blue, 46" as a FOREIGN KEY?

    When you choose a PRIMARY KEY, you must think about how can it be referenced? That is why the SURROGATE INT (4 bytes) IDENTITY so popular: it provides an easy, efficient connection between tables.

    Database is tables & connections. You have to make sure that both are efficient.

    Lucky thing: the brightest brains of RDBMS industry provided us with the UNIQUE KEY (clustered or nonclustered), which is perfect for NATURAL KEY.

    > 2.  It is my understanding that a Foreign Key does not have to reference a Primary Key

    True, FK can reference PK, UQ (unique key) or unique index. If you really want to create confusion at your IT department, just start FK-ing unique indexes.

    The point is that everyone is trained as FOREIGN KEY referencing PRIMARY KEY. Therefore it is counterproductive to design FK referencing UQ or unique index.

    It seems to me like clustering on a LastName, FirstName results in an expensive clustered index.

    But that was just an example to demonstrate that the PK can be nonclustered. My point is: clustered index should be used to support business critical queries.

    In the case of the Celebrity table, I did not provide business critical queries, so we cannot tell if that is a good choice or not.

    Nothing but nothing is "expensive" to support business critical queries. Performance rules! You don't want 500 users complaining because you put the clustered index to the wrong place following a rule in an article?

    >clustered index be one that is unique, narrow, ever-increasing, and predictable

    Generally those are good qualities which support good performance. (In fact, the entire table should be narrow with fixed size columns.) Yet, the ultimate decision is made by the requirement for top performance of business critical queries.

    Kimberly Tripp: "BUT I'll start by saying that the best way to tune the vast majority of range  queries is through non-clustered [covering] indexes. "

    It depends.... Covering index is usually wide at the leaf level, it may slow down other business critical queries. Certainly, use covering index if you are in a jam, usually speeds up the target query. But then start getting performance feedback about the other queries which may be effected. Nonetheless, I disagree with Kimberly on this point. Clustered index is the best way to support business critical range queries.

    What is a business critical query?

    Query executed 1000 times a day is business critical.

    Query executed 1 time a day by the CEO may be business critical as well. However, you can go to the CEO and tell him/her that the clustered index either goes to support the 1000 users or his single query. Let him decide.


    Kalman Toth SQL 2008 GRAND SLAM
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012










    • Edited by Kalman Toth Saturday, November 24, 2012 12:51 AM
    • Proposed as answer by David Gutter Sunday, November 25, 2012 7:55 AM
    Friday, November 23, 2012 9:04 PM
  • Thank you for your reply. I had a few questions so I can better understand the why behind your advice:

    1.) What is the rationale behind (almost) never making a Natural Key the Primary Key?

    2.) It is my understanding that a Foreign Key does not have to reference a Primary Key. In the case like the one listed above, is there a reason tables referencing an ASP User could not reference AspUserLookupID? That solves the issue of the foreign key being wide as a result of choosing a wide Primary Key.

    3.) It was my understanding that the clustered index serves as the basis for all other indexes. That, along with potential storage implications in some SQL implementations, leads many to suggest that the clustered index be one that is unique, narrow, ever-increasing, and predictable. For instance, Kimberly Tripp appears to be a strong proponent of that approach. Can you help me understand where the flaw is in Kimberly's view (or in my understanding of what you are proposing if your views are aligned). It seems to me like clustering on a LastName, FirstName results in an expensive clustered index.

    Thank you again for your feedback.

    Edit: Your post updated with a question that wasn't there originally. I said UNIQUEIDENTFIER was meaningful because that is how ASP.NET uniquely identifies each user within its Login framework. I think that design choice was made to make users portable. So it has inherent meaning in the ASP.NET Login Framework that IDENTITY lacks.


    Let me talk about 1) and 2) together.  Yes, you can do foreign keys where the related columns in the parent table are either the primary key columns, or the columns in a unique index (such an index will exist if either you have defined a unique index or you have defined a unique constraint) on the parent table.  So, if you defined your natural key as the primary key, the surrogate key as a unique index or constraint, and then used the surrogate key as the foreign key in the child tables, that would work OK.  But that's normally what is in practice done by SQL database designers.  It's not wrong, it's just not commonly done.

    With regard to 3), Kimberly is not wrong.  But, choosing a clustered index is a difficult process.  This is because it is very important for your system performance and you can only have one clustered index per table.  And, in that blog post Kimberly was responding to someone who wrote "The most important characteristic for a Clustered Index key is to satisfy range queries."  She said that there were other considerations that were often more important to your performance.  But she wasn't saying (IMO) that being useful for range queries was not ever important, just that that is an often used criteria that is overrated by many people.  Unique, narrow, ever-increasing are important.  There are others, see for example

    http://www.sqlskills.com/blogs/kimberly/post/The-Clustered-Index-Debate-again!.aspx

    So, now our list is unique, narrow, ever-increasing, static, non of the columns in the key are nullable, none of the columns in the key are variable width.  To this we should add the index is useful for range queries which are expensive, commonly done, and not covered by a nonclustered index.  Since many of these goals will conflict for many tables, picking the correct keys for the clustered index can be difficult.  You want the one which offers the best overall performance for your system.  And like all performance questions, this is more of an art than a science.

    Tom

    • Marked as answer by Mercury529 Sunday, November 25, 2012 10:46 PM
    Friday, November 23, 2012 9:42 PM

All replies

  • Hi

    You can read about normalization .Hope it will help you.


    Ahsan Kabir Please remember to click Mark as Answer and Vote as Helpful on posts that help you. This can be beneficial to other community members reading the thread. http://www.aktechforum.blogspot.com/

    Friday, November 23, 2012 7:43 PM
  • >If a surrogate key exists solely for performance reasons because of a wide Natural Key (such as a UNIQUEIDENTIFIER, VARCHAR, or multi-column key), make the surrogate key the clustered index but make the Natural Key the Primary Key. This offers maximum performance while still easily constraining entries based on the uniqueness guaranteed by the Natural Key

    Never make the NATURAL KEY the PRIMARY KEY.  (You must have a absolute airtight case to violate this rule.)

    Make the NATURAL KEY UNIQUE KEY clustered or nonclustered.

    If you make the WIDE NATURAL KEY the PRIMARY KEY, the FOREIGN KEYs must be WIDE also, and you are spreading table content from the PK table to the FK tables, a subtle violation of 3NF rules. "Elvis Presley" does not make a good FOREIGN KEY, it makes a good rock singer.

    Design example:

    CREATE TABLE Celebrity(
    	CelebrityID INT IDENTITY(1,1) PRIMARY KEY nonclustered,
    	LastName varchar(32) NOT NULL,
    	FirstName varchar(32) NOT NULL,
    	UNIQUE clustered (LastName, FirstName),
    	Birthdate DATE NOT NULL,
    	Profession varchar(64) NOT NULL,
    	CreateDate date default (CURRENT_TIMESTAMP));
    GO

    To decide about where to put the clustered index, you have to decide first about business critical queries.

    CLUSTERED index should be used to support business critical queries.

    Frequently that means the NATURAL KEY (UNIQUE KEY) should have the clustered index, but not always.

    >- Always consider the question of clustered indexing independently of Primary Key selection (even though they often end up being the same)

    Yes, clustered index is a "luxury" item in optimization, therefore is should be rewarded to business critical queries.

    Yes, frequently it ends up the same because clustered is the default for PRIMARY KEY.  Easy to remember: "PRIMARY KEY nonclustered" is the alternate term. 

    >So because a Natural Key exists, I am inclined to make it the Primary Key since it provides more meaningful information than using the meaningless IDENTITY field as the Primary Key:

    This part I don't understand: uniqueidentifier a meaningful natural key? What is meaningful about it? More meaningful than IDENTITY?

    It does not matter. You want the PRIMARY KEY to be meaningless. That way you never have to change it.

    CREATE TABLE AspUserLookup (
     AspUserLookupID SMALLINT IDENTITY (-32768, 1) PRIMARY KEY nonclustered,
     UserID UNIQUEIDENTIFIER UNIQUE CLUSTERED REFERENCES dbo.aspnet_Users(UserId) ON DELETE CASCADE
     );

    There is an optimization problem with your FOREIGN KEY: it is 16 bytes as opposed to the usual INT (4 bytes) FK. It may be OK, but you must have a good explanation why 16 bytes necessary for something doable in 4 bytes? The extra 12 bytes in each row is an unnecessary performance overhead.


    Kalman Toth SQL 2008 GRAND SLAM
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012












    • Edited by Kalman Toth Friday, November 23, 2012 8:39 PM
    Friday, November 23, 2012 7:51 PM
  • Thank you for your reply. I had a few questions so I can better understand the why behind your advice:

    1.) What is the rationale behind (almost) never making a Natural Key the Primary Key?

    2.) It is my understanding that a Foreign Key does not have to reference a Primary Key. In the case like the one listed above, is there a reason tables referencing an ASP User could not reference AspUserLookupID? That solves the issue of the foreign key being wide as a result of choosing a wide Primary Key.

    3.) It was my understanding that the clustered index serves as the basis for all other indexes. That, along with potential storage implications in some SQL implementations, leads many to suggest that the clustered index be one that is unique, narrow, ever-increasing, and predictable. For instance, Kimberly Tripp appears to be a strong proponent of that approach. Can you help me understand where the flaw is in Kimberly's view (or in my understanding of what you are proposing if your views are aligned). It seems to me like clustering on a LastName, FirstName results in an expensive clustered index.

    Thank you again for your feedback.

    Edit: Your post updated with a question that wasn't there originally. I said UNIQUEIDENTFIER was meaningful because that is how ASP.NET uniquely identifies each user within its Login framework. I think that design choice was made to make users portable. So it has inherent meaning in the ASP.NET Login Framework that IDENTITY lacks.


    • Edited by Mercury529 Friday, November 23, 2012 8:51 PM
    Friday, November 23, 2012 8:38 PM
  • At this point, my logic says the following guideline makes sense in the majority of cases:

    - Always consider the question of clustered indexing independently of Primary Key selection (even though they often end up being the same)

    - If a surrogate key exists solely for performance reasons because of a wide Natural Key (such as a UNIQUEIDENTIFIER, VARCHAR, or multi-column key), make the surrogate key the clustered index but make the Natural Key the Primary Key. This offers maximum performance while still easily constraining entries based on the uniqueness guaranteed by the Natural Key

    So given the novel above, I had a few questions:

    1.) Is there some gap in my knowledge that makes my guideline regarding Primary Key selection flawed?

    2.) Is there a better alternative to my guideline for selecting Primary Key and clustered indexes?

    3.) If you agree with my design guideline, what factors could potentially prove an exception to the rule?


    The following is IMO.

    Yes, the choice of primary key and clustered index key is (or, anyway, ought to be) completely separate.  Microsoft has confused this issue by making the primary key the clustered index by default if your table does not already have a clustered index when the primary key is created.  But that is only a default.  It can, and often should, be overridden.  You choose the clustered index by what will give your system the best overall performance.  The choice for a primary key is determined by the design requirements for your system.  Performance normally comes into the choice of which column(s) should be your primary key only when deciding whether or not to have a surrogate key.  So choose your primary key column(s).  Then in a separate process, choose your clustered index.  Then, if they are the same, create the primary key as a clustered index (I would recommend explicitly making it clustered, thus making it clear you did this on purpose).  If they are different, create the primary key constraint as nonclustered, and then create the clustered index on the columns you have chosen.

    (Almost?) always, if you have a surrogate key, you would want to make it the primary key. In fact, I don't recall ever creating a surrogate key and not making it the primary key, and I can't think of a reason why you would do that, so I was going to say you always want to do that, but decided to leave myself a small amount of wiggle room <grin>.  Then, put a unique index on the natural key.  Then any foreign keys in other tables which reference this table would have the surrogate key.  Common reasons for wanting a surrogate key include cases where the natural key is wide and this table has child tables where each row needs a reference to this table.  If you put the natural key in each row of the child table(s), then the child tables can get very large.  If you use only a narrow surrogate key (linke an integer identity column), you can save much space in the child table(s).  Another reason to use a surrogate key is if the natural key is subject to change (for example, your companies product id might be the natural key of your product table, but if your company periodically changes product id, you probably want a surrogate key for products).  If you use the natural key as the primary key and use it in the foreign key relationships in the child tables, then whenever you change the natural key in the parent table, you (or SQL if you use the CASCADE option), must change it in all the child tables.  This can make changing the natural key in one row of the parent table into an update of thousands or millions of child table rows.  A big advantage of surrogate keys is that since they exist only in your database and are not ever exposed to any user (at least if you are handling your surrogate keys correctly), you should never need to update them.

    Tom

    Friday, November 23, 2012 8:47 PM
  • >1.) What is the rationale behind (almost) never making a Natural Key the Primary Key?

    Just look at the Celebrity table. Do you want "Frank" "Sinatra" as double column FOREIGN KEY in tables referencing the Celebrity table? Or in a Product table, "Mountain Bike 200 Blue, 46" as a FOREIGN KEY?

    When you choose a PRIMARY KEY, you must think about how can it be referenced? That is why the SURROGATE INT (4 bytes) IDENTITY so popular: it provides an easy, efficient connection between tables.

    Database is tables & connections. You have to make sure that both are efficient.

    Lucky thing: the brightest brains of RDBMS industry provided us with the UNIQUE KEY (clustered or nonclustered), which is perfect for NATURAL KEY.

    > 2.  It is my understanding that a Foreign Key does not have to reference a Primary Key

    True, FK can reference PK, UQ (unique key) or unique index. If you really want to create confusion at your IT department, just start FK-ing unique indexes.

    The point is that everyone is trained as FOREIGN KEY referencing PRIMARY KEY. Therefore it is counterproductive to design FK referencing UQ or unique index.

    It seems to me like clustering on a LastName, FirstName results in an expensive clustered index.

    But that was just an example to demonstrate that the PK can be nonclustered. My point is: clustered index should be used to support business critical queries.

    In the case of the Celebrity table, I did not provide business critical queries, so we cannot tell if that is a good choice or not.

    Nothing but nothing is "expensive" to support business critical queries. Performance rules! You don't want 500 users complaining because you put the clustered index to the wrong place following a rule in an article?

    >clustered index be one that is unique, narrow, ever-increasing, and predictable

    Generally those are good qualities which support good performance. (In fact, the entire table should be narrow with fixed size columns.) Yet, the ultimate decision is made by the requirement for top performance of business critical queries.

    Kimberly Tripp: "BUT I'll start by saying that the best way to tune the vast majority of range  queries is through non-clustered [covering] indexes. "

    It depends.... Covering index is usually wide at the leaf level, it may slow down other business critical queries. Certainly, use covering index if you are in a jam, usually speeds up the target query. But then start getting performance feedback about the other queries which may be effected. Nonetheless, I disagree with Kimberly on this point. Clustered index is the best way to support business critical range queries.

    What is a business critical query?

    Query executed 1000 times a day is business critical.

    Query executed 1 time a day by the CEO may be business critical as well. However, you can go to the CEO and tell him/her that the clustered index either goes to support the 1000 users or his single query. Let him decide.


    Kalman Toth SQL 2008 GRAND SLAM
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012










    • Edited by Kalman Toth Saturday, November 24, 2012 12:51 AM
    • Proposed as answer by David Gutter Sunday, November 25, 2012 7:55 AM
    Friday, November 23, 2012 9:04 PM
  • Thank you for your reply. I had a few questions so I can better understand the why behind your advice:

    1.) What is the rationale behind (almost) never making a Natural Key the Primary Key?

    2.) It is my understanding that a Foreign Key does not have to reference a Primary Key. In the case like the one listed above, is there a reason tables referencing an ASP User could not reference AspUserLookupID? That solves the issue of the foreign key being wide as a result of choosing a wide Primary Key.

    3.) It was my understanding that the clustered index serves as the basis for all other indexes. That, along with potential storage implications in some SQL implementations, leads many to suggest that the clustered index be one that is unique, narrow, ever-increasing, and predictable. For instance, Kimberly Tripp appears to be a strong proponent of that approach. Can you help me understand where the flaw is in Kimberly's view (or in my understanding of what you are proposing if your views are aligned). It seems to me like clustering on a LastName, FirstName results in an expensive clustered index.

    Thank you again for your feedback.

    Edit: Your post updated with a question that wasn't there originally. I said UNIQUEIDENTFIER was meaningful because that is how ASP.NET uniquely identifies each user within its Login framework. I think that design choice was made to make users portable. So it has inherent meaning in the ASP.NET Login Framework that IDENTITY lacks.


    Let me talk about 1) and 2) together.  Yes, you can do foreign keys where the related columns in the parent table are either the primary key columns, or the columns in a unique index (such an index will exist if either you have defined a unique index or you have defined a unique constraint) on the parent table.  So, if you defined your natural key as the primary key, the surrogate key as a unique index or constraint, and then used the surrogate key as the foreign key in the child tables, that would work OK.  But that's normally what is in practice done by SQL database designers.  It's not wrong, it's just not commonly done.

    With regard to 3), Kimberly is not wrong.  But, choosing a clustered index is a difficult process.  This is because it is very important for your system performance and you can only have one clustered index per table.  And, in that blog post Kimberly was responding to someone who wrote "The most important characteristic for a Clustered Index key is to satisfy range queries."  She said that there were other considerations that were often more important to your performance.  But she wasn't saying (IMO) that being useful for range queries was not ever important, just that that is an often used criteria that is overrated by many people.  Unique, narrow, ever-increasing are important.  There are others, see for example

    http://www.sqlskills.com/blogs/kimberly/post/The-Clustered-Index-Debate-again!.aspx

    So, now our list is unique, narrow, ever-increasing, static, non of the columns in the key are nullable, none of the columns in the key are variable width.  To this we should add the index is useful for range queries which are expensive, commonly done, and not covered by a nonclustered index.  Since many of these goals will conflict for many tables, picking the correct keys for the clustered index can be difficult.  You want the one which offers the best overall performance for your system.  And like all performance questions, this is more of an art than a science.

    Tom

    • Marked as answer by Mercury529 Sunday, November 25, 2012 10:46 PM
    Friday, November 23, 2012 9:42 PM
  • Thank you for your reply as well Tom.

    Your reply and Kalman's reply make me wonder if perhaps there is more of a philosophical difference and less of a difference in the nuts and the bolts of the matter (or just some incorrect assumption somewhere on my part). I say this because both of you suggested that my wide PK would prove problematic for performance/storage of FKs. My understanding was a FK can reference any Superkey. In my example, the intention was for all tables to reference the IDENTITY column AspUserLookupID not the PK.

    I am self-taught with regard to SQL (using the various resources available to me), so I don't have the experience in SQL production environments and the practices in those environments. It seems to me based on your reply and Kalman that FKs referencing PKs is a prominent practice. With understanding that is likely prevalent, it makes more sense to me that there wouldn't be discussion of making a surrogate key a clustered index and a PK the NK.

    It appears that if you are explicit in your DB about things like UNIQUE indexes, constraints, and such (and you should be), then PK serves more of a documentation role than it does a technical role (though admittedly there may be other things DBs do with Primary Keys from a technical level that I am unaware of).

    Coming from outside the culture, I do find it strange that a PK is the key that says the least about the data contained. I understand it from a technical level when combined with insight as to the practices from your earlier posts, but from a philosophical level it feels like more of an empty distinction. It seems to me like two concepts should exist for every table (if only conceptually):

    1.) What column constitutes the most efficient Foreign Key for general referencing purposes (the intentionally meaningless IDENTITY field)

    2.) A Natural Key (where one exists) that gives other developers insight into what business assumptions were made about the data

    If that existed as opposed to just the Primary Key concept, then developers wouldn't be forced to choose between making a meaningless value Primary or the data that is actually considered important for identification purposes. With a model like one above, you'd have better insight into design intentions. When no Foreign Key exists, the table is not designed for any lookups. When the Natural Key is the Foreign Key, the belief was the Natural Key was a permanent value, incapable of change. Where The Foreign Key differed from the Natural Key, the belief was the Natural Key was capable of changing at sometime in the future (the vast majority of cases since change may be unlikely but possible). And when no Natural Key existed, there was no way to uniquely identify data from a business perspective.

    But maybe that already exists to a certain degree with Primary Key (Foreign Key reference column) and UNIQUE (Natural Key). But for someone who isn't familiar with the SQL culture and practices, that nomenclature is non-intuitive (and frankly a little bit contradictory with Primary Key).

    Regardless, I appreciate you both providing feedback. It seems like I would be best served to follow the practices most widely used today and just start thinking of Primary Key as a meaningless Foreign Key reference value only (except where the Natural Key is narrow and there is an absolute guarantee of permanence). Thank you both again.


    • Edited by Mercury529 Friday, November 23, 2012 10:00 PM
    Friday, November 23, 2012 9:59 PM
  • >2.) A Natural Key (where one exists) that gives other developers insight into what business assumptions were made about the data

    A NATURAL KEY must exist, or you don't have a table.

    Inexperienced designers always start with "TablenameID INT IDENTITY(1,1) PRIMARY KEY" as the first row of the table, and then feel "safe" because they have a PRIMARY KEY.

    That would be OK as long as they construct the NATURAL KEY and make it a UNIQUE KEY (clustered or nonclustered).  It happens often that the table is left without a NATURAL KEY (UNIQUE KEY).

    I recommend the following table design process:

    1. Design the NATURAL KEY, indicate it with a UNIQUE KEY. (Just the UNIQUE word in syntax).

    2. Design the FK columns.

    3. Design the non-key columns.

    4. Design the row-maintenance columns.

    5. Add the SURROGATE IDENTITY PK at the top.

    Where to put the clustered index is a guess at database design time. That can only be decided when you identify the business critical queries. Clustered index can be moved quite easily if the table is not huge.


    Kalman Toth SQL 2008 GRAND SLAM
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012



    • Edited by Kalman Toth Friday, November 23, 2012 10:21 PM
    Friday, November 23, 2012 10:12 PM
  • Thanks again Kalman.

    With regard to my statement about Natural Keys, I was trying to take into account special circumstances where a Natural Key might not really exist. For instance, a table that tracks non-distinct log entries doesn't necessarily have an innate Natural Key as defined by the data. If you are willing to accept duplication of data, then creating a surrogate key (like IDENTITY) every time an entry is added seems acceptable. But it would seem like in most cases, a database is founded on the principle of having data that is identifiable by a Natural Key.

    Thank you for the advice on the process of designing a database. Your approach seems very logical.


    • Edited by Mercury529 Saturday, November 24, 2012 1:34 AM
    Saturday, November 24, 2012 1:33 AM
  • I had two other questions that are related to the original. If you can share any input, that would be appreciated:

    1.) So if we take my example above of the ASP.NET Login Structure, we now should have the following given the advice above:

    CREATE TABLE AspUserLookup (
    AspUserLookupID SMALLINT IDENTITY (-32768, 1) PRIMARY KEY CLUSTERED,
    UserId UNIQUEIDENTIFIER NOT NULL UNIQUE NONCLUSTERED REFERENCES dbo.aspnet_Users(UserId) ON DELETE CASCADE
    )

    Now what if we extended the concept of a user to include some leaf table type of information like colors the user likes. So we could have something like

    CREATE TABLE AspUserColorNameLookup (
    AspUserColorNameLookupID SMALLINT IDENTITY (-32768, 1) PRIMARY KEY CLUSTERED,
    ColorName VARCHAR(64) NOT NULL UNIQUE NONCLUSTERED
    )

    and

    CREATE TABLE AspUserColors (
    AspUserLookupID SMALLINT NOT NULL REFERENCES AspUserLookup(AspUserLookupID) ON DELETE CASCADE,
    AspUserColorNameLookupID SMALLINT NOT NULL REFERENCES dbo.AspUserColorNameLookup(AspUserColorNameLookupID) ON DELETE CASCADE

    UNIQUE (AspUserLookupID, AspUserColorNameLookupID)
    )

    a.) Is there anything egregiously wrong with the general construction presented here?

    b.) Does an IDENTITY Column need to be added even if there is no intention of ever using the leaf data as a foreign key elsewhere?

    c.) How does one go about selecting the Primary Key for AspUserColors?

    d.) How does one go about selecting what the clustered key should be (the first one before you have a chance to gather analysis of the system at work)?

    2.) Should I have any natural inclination when selecting SMALLINT, INT, or BIGINT for an IDENTITY field? It would seem SMALLINT offers storage and likely performance benefits (the performance benefits I have not seen clearly stated, so it is an assumption). But is it a nightmare to convert to a larger IDENTITY column if some fundamental assumption about how many entries will be present turns out to be wrong? Are there any best practices around selecting the right size IDENTITY column?

    Thanks again.


    • Edited by Mercury529 Saturday, November 24, 2012 2:27 AM
    Saturday, November 24, 2012 2:25 AM
  • a.) Is there anything egregiously wrong with the general construction presented here?

    b.) Does an IDENTITY Column need to be added even if there is no intention of ever using the leaf data as a foreign key elsewhere?

    c.) How does one go about selecting the Primary Key for AspUserColors?

    d.) How does one go about selecting what the clustered key should be (the first one before you have a chance to gather analysis of the system at work)?

    2.) Should I have any natural inclination when selecting SMALLINT, INT, or BIGINT for an IDENTITY field? It would seem SMALLINT offers storage and likely performance benefits (the performance benefits I have not seen clearly stated, so it is an assumption). But is it a nightmare to convert to a larger IDENTITY column if some fundamental assumption about how many entries will be present turns out to be wrong? Are there any best practices around selecting the right size IDENTITY column?

    Thanks again.


    a) It looks fine to me, except instead of declaring a unique constraint, you probably want to make it a (clustered) primary key on AspUserLookupID, AspUserColorNameLookupID - see discussion of b) below.

    b) It is normal in a cross reference table like AspUserColors to just make the primary key the keys to the two tables you are referencing.  So, I would not normally add an identity column to this type of table.

    c) See b).  If you do that, your only real decision is do you want the primary key to be (AspUserLookupID, AspUserColorNameLookupID) or (AspUserColorNameLookupID, AspUserLookupID).  I would tend to choose it either the insert pattern into this table (for example, if your inserts tend to insert a new user and a bunch of colors associated with that user, you may want AspUserLookupID to be the first column in the key because since the new user will have a AspUserLookupID larger than any row currently in the AspUserColors table all these new rows will be added at the end of the clustered index), or choose it based on the type of select queries you tend to do, that is do you do more queries like "give me all colors associated with user 37" or do you do more like "give me all users associated with color 14"?

    d)  You do the best job you can at estimating what will be a good choice using experience and reading and guesswork.  Don't worry much about small tables or tables that are used very often.  Pay a lot of attention to large tables that will have a great deal of selects, inserts, updates, or deletes.  Benchmarks are helpful if you have the time and resources.  If you do benchmarks, do them with a realistic amount of data.

    2)  I would normally default to INT.  Yes, you can use TINYINT, or SMALLINT.  But if you are going to use SMALLINT, you can have at most 65,536 rows.  Using SMALLINT that takes 65,536 * 2 bytes or 131,072 bytes (plus some overhead I'm going to ignore for purposes of this discussion).  Using INT that takes 65,536*5 or 262,144 bytes.  So using SMALLINT will save you at most 131,072 bytes (and it will actually be much less than that because if your table gets anywhere near 65,536 rows you almost certainly need to be using INTEGER to allow for future growth anyway).  100K or so is nothing in the amount of disk sapce, IO, memory, etc and not worth worrying about.  Please note that there are people out there who will disagree with me on this and believe if you know you will never have more than say 1,000 rows you should use SMALLINT.  Me, I've seen too many cases where tables which when designed were never intended to grow large, but they did.

    So why don't I then just say use BIGINT everywhere.  I don't do that because INT can hold many more rows (more than 4 billion).  So suppose I have a table with 400 million (400,000,000) rows - note that that is only 10% of the largest number.  For that table an INT works just fine and takes 400,000,000*4 or about 1.6GB (gigabytes).  If I use bigint, those 400,000,000 rows will need 8 bytes for the BIGINT or about 3.2GB.  Now the cost is 1.6GB and that's much too much.  So, I wouldn't use BIGINT unless this table has a reasonable chance of growing beyond the 4 billion rows.

    You bet, changing the size of a primary key column can be an expensive process.  You especially want to avoid doing that on a table which has a large number or rows or if the table has child tables and the child tables have a large number of rows.  Life is much easier if you pick the correct size to begin with. 

    Tom

    Saturday, November 24, 2012 3:53 AM
  • The junction table AspUserColors should have a PRIMARY KEY:

    CREATE TABLE AspUserColors (
     AspUserLookupID SMALLINT NOT NULL 	
    		REFERENCES AspUserLookup(AspUserLookupID) ON DELETE CASCADE,
     AspUserColorNameLookupID SMALLINT NOT NULL 	
    		REFERENCES dbo.AspUserColorNameLookup(AspUserColorNameLookupID) ON DELETE CASCADE
     PRIMARY KEY (AspUserLookupID, AspUserColorNameLookupID)
    );

     

    If it is FOREIGN KEY referenced, then we have to add IDENTITY SURROGATE PK in order to avoid double column FOREIGN KEY:

    CREATE TABLE AspUserColors (
     ID INT IDENTITY(1,1) PRIMARY KEY nonclustered,
     AspUserLookupID SMALLINT NOT NULL 	
    	REFERENCES AspUserLookup(AspUserLookupID) ON DELETE CASCADE,
     AspUserColorNameLookupID SMALLINT NOT NULL 	
    	REFERENCES dbo.AspUserColorNameLookup(AspUserColorNameLookupID) ON DELETE CASCADE
     UNIQUE clustered (AspUserLookupID, AspUserColorNameLookupID),
     Description nvarchar(256),
     CreateDate date default(CURRENT_TIMESTAMP)
     );




    Kalman Toth SQL 2008 GRAND SLAM
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012






    • Edited by Kalman Toth Saturday, November 24, 2012 2:36 PM
    Saturday, November 24, 2012 12:48 PM
  • Thank you both for your answers. The logic you both presented makes sense to me and is fairly intuitive. I appreciate the feedback.

    Tom, in your analysis of SMALLINT vs BIGINT, you were looking at the storage costs based only on the lookup table correct? For instance, if we take the Color example, AspUserColorNameLookup would only be at most ~131kb larger for going with INT vs SMALLINT. But if there were 100,000,000s of users that reference those colors as FKs, the decision to go with the INT vs SMALLINT could have much larger storage ramifications, right?

    On modern hardware, is there any performance advantage comparing a SMALLINT instead of an INT or BIG INT? I'd imagine not, unless implementations of SQL somehow was designed to do multiple comparisons at one time when dealing with integers smaller than 32/64 bits. I wasn't able to find any real-world results on Google. It's probably too minor a point to concern myself with, but it's interesting to me just from deepening my understanding of any optimizations SQL might have.

    Saturday, November 24, 2012 5:47 PM
  • >On modern hardware, is there any performance advantage comparing a SMALLINT instead of an INT or BIG INT?

    Probably insignificant due to the immense CPU power available on a server (even on your laptop!).

    However, it pays peformance-wise to make the table narrower and if possible use only fixed-size columns. Narrow table means that more rows fit on an 8K page which is how the (slow) disk io is carried out.

    Use smallint (2 bytes) instead of int (4 bytes) if fits.

    Same for DATE (3 bytes) instead of DATETIME (8 bytes), SMALLMONEY (4 bytes) vs. MONEY (8 bytes), and so on.

    If there is only one varchar(15) column in the table, then make all the other columns as narrow as possible and change the varchar to char(15) fixed width. That makes the entire table row fixed width. That means, the database engine can calculate ahead/back as searching in the table. The net result can be performance gain for business critical queries.


    Kalman Toth SQL 2008 GRAND SLAM
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012





    • Edited by Kalman Toth Saturday, November 24, 2012 7:21 PM
    • Proposed as answer by David Gutter Sunday, November 25, 2012 7:56 AM
    Saturday, November 24, 2012 7:05 PM
  • Yes, if your child tables have many more rows, that would make the difference in disk size, memory size, and I/O between SMALLINT and INT much greater.  So that would be a good case for using SMALLINT if SMALLINT will work for your parent table primary key. 

    Tom

    Sunday, November 25, 2012 4:38 AM
  • There is no such thing as a "universal, one-size-fits-all" key. Just as no two sets of entities are the same, the attributes that make them unique have to be found in the reality of the data. Here is my classification of types of keys (needs a monofont): 

                                           natural artificial exposed surrogate 
    =================================================
    Constructed from reality    |   
       of the data model           |  Y         N         N         Y
                                              |
          verifiable in reality       |  Y         N         N         N
                                              |
           verifiable in itself        |  Y         Y         N         N
                                             |
        visible to the user         |  Y         Y         Y         N

    1) A natural key is a subset of attributes which occur in a table and act as a unique identifier. They are seen by the user. You can go to the external reality and verify them. you would also like to have some validation rule. Example: UPC codes on consumer goods (read the package barcode) and validate them with a check digit or a manufacturer's website, geographical co-ordinates (get a GPS). 

    2) An artificial key is an extra attribute added to the table which is seen by the user. It does not exist in the external reality, but can be verified for syntax or check digits inside itself. It is up to the DBA to maintain a trusted source for them. Example: the open codes in the UPC scheme which a user can assign to his own stuff. The check digits still work, but you have to verify them inside your own enterprise.

    If you have to construct a key yourself, it takes time to design them, to invent a validation rule, set up audit trails, etc. 

    3) An "exposed physical locator" is not based on attributes in the data model and is exposed to user. There is no reasonable way to predict it or verify it, since it usually comes from the physical state of the hardware at the time of data insertion. The system obtains a value thru some physical process in the storage hardware totally unrelated to the logical data model. Example: IDENTITY columns, other proprietary, non-relational auto-numbering devices. 


    Technically, these are not really keys at all, since they are attributes of the PHYSICAL storage and are not even part of the LOGICAL data model. But they are handy for lazy, non-RDBMS programmers who don't want to research or think!  This is the worst way to program in SQL. The historical reason for this style is how you can fake a Poitier chain! This made some sense in the days of slow, small 16 bit hardware for performance. Today, it is just overhead and dangerous; the old network databases had code to maintain the links, but you have to write it yourself in SQL . 
    4) A surrogate key is system generated to replace the actual key behind the covers where the user never sees it. It is based on attributes in the table. Example: Teradata hashing algorithms, indexes, pointer chains, ADABASE numbers, etc. 

    The fact that you can never see it or use it for DELETE and UPDATE or create it for INSERT is vital. When users can get to them, they will screw up the data integrity by getting the real keys and these physical locators out of synch. The system must maintain them. 

    **  Notice that people get "exposed physical locator" and surrogate mixed up; they are totally different concepts. **

    An appeal to authority, with a quote from Dr. Codd: "..Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434. 

    This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does. 

    Codd also wrote the following:

    "There are three difficulties in employing user-controlled keys as permanent surrogates for entities.

    (1)  The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).

    (2)  Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.

    (3)  It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree).

    These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410).

    References

    Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

    The steps for finding a key are 

    1) Look for an industry standard and the trusted external source that maintains and verifies it. I count this as a natural key, but you could argue that it is artificial. 

    2) Look for a natural key in the attributes. Example: (longitude, latitude) makes a good key for a geographical location. A GPS can be used to verify it. 

    3) If you must design a new identifier, plan it carefully -- especially if people will see and use it. You have to be able to verify it in application programs, so you should have a regular expression, other syntax rules, check digits. You have to be able to be verify in the reality of the model or with a trusted source that you maintain. 

    Validation means the format is good -- "This could one of our invoice numbers because it is 7 digits long, passes a Bull code check digit and begins with  { '01', '02', '07', '99'}" 

    Verification means that it references a real entity -- "This is a real invoice because I can look it up in Accounts Payable and trace its approval back to Cindy Lu Who on 2005-02-12." 

    This is hard work. I have a few chapters in SQL PROGRAMMING STYLE on scales, measurements and how to design encoding schemes. 

     

    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    • Proposed as answer by Naomi N Sunday, November 25, 2012 3:40 PM
    Sunday, November 25, 2012 1:06 PM
  • Thanks again for all the input everyone.

    Kalman, I attempted to mark one of your posts as an answer as well and got an "Unexpected Error" popup. Perhaps there is an issue on the MSDN DB :). If you know how to resolve the popup issue so I can mark your post as an answer, let me know. I'll mark your post as an answer in addition to Tom's.

    Sunday, November 25, 2012 10:53 PM
  • It's a bug very unfortunate. See this thread http://social.msdn.microsoft.com/Forums/en-US/reportabug/thread/63e3f4b0-fc6f-4b4f-bf7a-b138dfcc89b5

    You may post in that thread with your case. I am also often unable to mark Kalman's answers.

    I also got this error trying to reply to you now.



    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    Monday, November 26, 2012 12:28 AM