locked
Primary Key Type for Multi User Database RRS feed

  • Question

  • This may fall under the heading of "duh" but I could really use confirmation of my design. I'm building a client app for multiple users that will each collect data in their local database and at various times upload the data to an Azure database. I'm using Entity Framework and the local database keys are being generated by the database. I'd like to use an integer key rather than UID for better performance. My question is concerning key duplication on the Azure database. I thought that I would have to use UIDs to eliminate conflicts. However, I just realized that I could use auto generated keys on the Azure database during the upload eliminating the possibility of a duplicate key and still use integer keys. I never have to download previously uploaded records so that would not be an issue. An error during upload because newly generated keys could no longer be synchronized will the client keys. Any advice or confirmation of my thinking would be appreciated.


    Donald Hofferber

    Tuesday, December 4, 2012 11:12 PM

Answers

  • Pieter makes some good points about keys.  I'll add that I think you are both over-anticipating issues and over-assuming. 

    Once again, "500 times slower" refers to what exactly?  That's a rhetorical question and you need to consider those posts in light of whatever aspect they were measuring (correctly, or just as possibly, incorrectly).  Also factor in the possibility that those posts just might be wrong.  A high point number next to someone's name does not denote infallibility.  Given your sizes, the performance impact of using server-generated GUIDs (which would only occur in the client databases) is likely to be neglible.  You didn't tell us the version of sql server you are using - which is something you should always include - but the newer versions of sql server have options to overcome the "randomness" of GUID values if that is a concern. 

    Now that it is apparent that your import process includes reconciliation, you need to be able to associate rows in the central db with the source db.  Deletions can be a troublesome issue.  For instance, how do you know (after the fact) that row x was deleted from the local db?  Perhaps it is time to step back and re-evaluate your approach.  Do you anticipate that the focus on data gathering/modification is the local DB and the central DB simply serves a collection mechanism?  Is the central DB a required part of your system?  Or merely a bonus feature?  Is the local DB/application operable as a standalone system?  There are no "right" answers here - but these questions should as least make you aware of how you intend to operate your system and therefore affect your design.  If the central DB is a critical part of the design, you might - as others have indicated - use it to manage a set of keys - one of which is assigned to a particular local db (in some manner).  That approach requires the central DB and makes it integral to the design.  Hopefully, you see how this leads you back to my previous thought - how do you intend for the system to be used.  As I see it, that decision is critical and needs to be made before you start worrying about primary keys. 

    You've been given some very specific recommendations.  I suggest you evaluate those against your "vision" of the system and how it works as a whole.  The problem with asking such general question is that the responses will be based on assumptions and past experience.  These may be very different from reality.  

    One last point - no matter what approach you choose, you should anticipate key collisions and have a way to resolve them in the system.  Somewhere, sometime, somehow - it will happen.  Usually at the worst possible time.

    • Marked as answer by DDHSolutions Wednesday, December 5, 2012 11:38 PM
    Wednesday, December 5, 2012 7:19 PM
  • In answer to you question: Yes, the same issue arises.

    Since key values are being generated at multiple sites, you somehow need to partition the key generation. This means that you have de facto a two-part key. Forcing this into a single column of your table measn you have a repeating group, and thus are, strictly speaking, failing 0th Normal Form. Yes, I know this is worked around all the time, but such shenagans are always workarounds to a bad design decision. Besides, what is better:

    1. Working with a 1+4 byte composite PK as I suggested above;
    2. Using a 32 byte UID that gives high confidence of uniquness BUT NO GUARANTEE; or
    3. Fiddling with starting seed valules for varying Identity columns, and worrying about documenting it so a sleep-deprived operator can fix it up at 4:00am some Saturday morning.

    "Premature optimization is the root of all evil." - Knuth

    If I provoked thought, please click the green arrow

    If I provoked Aha! please click Propose as Answer

    • Marked as answer by DDHSolutions Wednesday, December 5, 2012 11:38 PM
    Wednesday, December 5, 2012 5:27 PM

All replies

  • You could always use a 2-part key, comprising Computer_ID combined with an Identity column from the respective client machine. Not ideal, but much shorter than a UID.

    "Premature optimization is the root of all evil." - Knuth

    If I provoked thought, please click the green arrow

    If I provoked Aha! please click Propose as Answer

    Tuesday, December 4, 2012 11:24 PM
  • That would work but I'd like to stay with auto generated keys and I believe that is either UID or integer only.

    Donald Hofferber

    Tuesday, December 4, 2012 11:57 PM
  • You can place a unique index on UID and make the INT IDENTITY(1,1) the PRIMARY KEY.

    Kalman Toth SQL 2008 GRAND SLAM
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012

    Wednesday, December 5, 2012 1:25 AM
  • Could you please clarify with a little more detail?  Perhaps a small example.  Thanks.

    Donald Hofferber

    Wednesday, December 5, 2012 1:54 AM
  • UID means GUID ? You are correct that using GUIDs hits performance, so in that case as others suggested you can use an IDENTITY property and create a PK on that column

    CREATE TABLE yourtable 

    (

    ID INT NOT NULL IDENTITY(1,1) PRIMARY KEY,

    col1.....,

    col2....

    ..............

    )


    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Blog: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance

    Wednesday, December 5, 2012 6:03 AM
  • 1. Does data originate in the local database only - and the azure DB merely collects the local data in a central location?  Or does data also get created in the azure database? 

    2.  Clarify "... use an integer key rather than UID for better performance".  Why do you believe this?  And on what basis are you evaluating performance?  Do you have a "performance" problem or are you attempting to avoid one? 

    3.  You never have to download previously updated records - but do you have to update them in the Azure DB from the local DB in some manner (and by "update", also include "delete")?

    Wednesday, December 5, 2012 2:51 PM
  • Thanks all for the discussion.  I have a question on the Identity key and then I'll answer Scott's questions.

    If I use an identity column for the key wouldn't I have the same potential duplicate key issue as using an integer key?

    1.  Data only originates in the local databases.  There are a number of these clients.

    2.  Posts on the subject indicate that GUID keys are about 500 times slower to use than integer keys.  These numbers assume a large database.  My client database should not ever be larger than 5,000 records and the Azure database not larger than 500,000.  Perhaps the impact will be negligible at these sizes?

    3.  Excellent point.  Yes, they may be updated or deleted from the client.


    Donald Hofferber

    Wednesday, December 5, 2012 4:41 PM
  • In answer to you question: Yes, the same issue arises.

    Since key values are being generated at multiple sites, you somehow need to partition the key generation. This means that you have de facto a two-part key. Forcing this into a single column of your table measn you have a repeating group, and thus are, strictly speaking, failing 0th Normal Form. Yes, I know this is worked around all the time, but such shenagans are always workarounds to a bad design decision. Besides, what is better:

    1. Working with a 1+4 byte composite PK as I suggested above;
    2. Using a 32 byte UID that gives high confidence of uniquness BUT NO GUARANTEE; or
    3. Fiddling with starting seed valules for varying Identity columns, and worrying about documenting it so a sleep-deprived operator can fix it up at 4:00am some Saturday morning.

    "Premature optimization is the root of all evil." - Knuth

    If I provoked thought, please click the green arrow

    If I provoked Aha! please click Propose as Answer

    • Marked as answer by DDHSolutions Wednesday, December 5, 2012 11:38 PM
    Wednesday, December 5, 2012 5:27 PM
  • Pieter makes some good points about keys.  I'll add that I think you are both over-anticipating issues and over-assuming. 

    Once again, "500 times slower" refers to what exactly?  That's a rhetorical question and you need to consider those posts in light of whatever aspect they were measuring (correctly, or just as possibly, incorrectly).  Also factor in the possibility that those posts just might be wrong.  A high point number next to someone's name does not denote infallibility.  Given your sizes, the performance impact of using server-generated GUIDs (which would only occur in the client databases) is likely to be neglible.  You didn't tell us the version of sql server you are using - which is something you should always include - but the newer versions of sql server have options to overcome the "randomness" of GUID values if that is a concern. 

    Now that it is apparent that your import process includes reconciliation, you need to be able to associate rows in the central db with the source db.  Deletions can be a troublesome issue.  For instance, how do you know (after the fact) that row x was deleted from the local db?  Perhaps it is time to step back and re-evaluate your approach.  Do you anticipate that the focus on data gathering/modification is the local DB and the central DB simply serves a collection mechanism?  Is the central DB a required part of your system?  Or merely a bonus feature?  Is the local DB/application operable as a standalone system?  There are no "right" answers here - but these questions should as least make you aware of how you intend to operate your system and therefore affect your design.  If the central DB is a critical part of the design, you might - as others have indicated - use it to manage a set of keys - one of which is assigned to a particular local db (in some manner).  That approach requires the central DB and makes it integral to the design.  Hopefully, you see how this leads you back to my previous thought - how do you intend for the system to be used.  As I see it, that decision is critical and needs to be made before you start worrying about primary keys. 

    You've been given some very specific recommendations.  I suggest you evaluate those against your "vision" of the system and how it works as a whole.  The problem with asking such general question is that the responses will be based on assumptions and past experience.  These may be very different from reality.  

    One last point - no matter what approach you choose, you should anticipate key collisions and have a way to resolve them in the system.  Somewhere, sometime, somehow - it will happen.  Usually at the worst possible time.

    • Marked as answer by DDHSolutions Wednesday, December 5, 2012 11:38 PM
    Wednesday, December 5, 2012 7:19 PM
  • I've marked Pieter's reply as an answer because it would work. However, Scott's in depth response and guidance is what I will follow. The central db is just a warehouse to store data to be pulled down to their main facility for various forms of reporting and publication. Using Azure with this central db is their architecture design and they are sticking with it. My job is to make it work. They will also do program updates via Azure. I agree that using GUID keys will probably be insignificant to performance and is simplest to implement so that's how I intend to proceed. One last thing you mentioned is that I should code for possible duplicate keys even if I use GUIDs. I always would code try/catch around such data transfers but would you really expect a possible duplicate?

    Thanks all again for a great discussion.


    Donald Hofferber

    Wednesday, December 5, 2012 11:52 PM
  • From Stack Overflow (http://stackoverflow.com/questions/3038023/uuid-collision-risk-using-different-algorithms):

    "Now, keep in mind that when you use an algorithm like Guid.Comb, you only have 10 bits of uniqueifier, which equates to 1024 separate values.  So if you're generating a huge number of GUIDs within the same few milliseconds, you will get collisions.  But if you generate GUIDs at a fairly low frequency, it doesn't really matter how many different algorithms you use at the same time, the likelihood of a collision is still practically nonexistent."

    [edit] The chance of a collision is roughly 50% at the square root of the codomain, 30 or 31 in this case. So every time 30 GUIDs are generated within 1ms there is a 50% chance of a collision. That's hardly remote for a busy server.


    "Premature optimization is the root of all evil." - Knuth

    If I provoked thought, please click the green arrow

    If I provoked Aha! please click Propose as Answer

    We are here to learn, to share knowledge, and to earn points; all in about equal measure.


    Thursday, December 6, 2012 3:17 AM
  • Follow-up:

    GUIDs/UIDs are one more type of hash algorithm, and with their use increasing SOMEONE will encounter a collision, just as SOMEONE will win the lottery. The question to ask is: What business effect occurs as the result of a collision? If it will bring your business down, then you have to code for it, no matter how unlikely. If it is just a nuisance requiring a couple of hours work by a hung-over operator early some Saturday morning, then just hire good operators.

    This is true for all possible though unlikely circumstances; If the occurrence will be catastrophic, then you must code for it; if the occurrence will not be catastrophic, make a judgement call. Good engineering is about making sound judgement calls from adequate research.


    "Premature optimization is the root of all evil." - Knuth

    If I provoked thought, please click the green arrow

    If I provoked Aha! please click Propose as Answer

    We are here to learn, to share knowledge, and to earn points; all in about equal measure.

    Thursday, December 6, 2012 3:23 AM
  • Pieter - Thanks for the sound advice.

    Donald Hofferber

    Thursday, December 6, 2012 8:01 PM