none
Primary Keys - int's vs. varchar

    Question

  • Hello all,

    I am embarking on a development project with a friend of mine, involving SQL2005. While both of us have experience in database development, we have both had this in different environments. He has experience with access, while I have experience with a proprietary database (I wont bother naming it).

    As we have these different backgrounds, we are having a few conflicts as to which way we should do various things - one of these areas being primary keys.
    My friend, coming from an access background, says that in access, the primary key should be an int (autonumber), which is usually hidden from the user, while you show a string field to the user, pretending it is the primary key (So for a customer table, the user will see "CUST0001" as the primary key, but the primary key is actually a hidden autonumber field). This is done as looking up by integer is faster.
    I, on the other hand, come from a database environment where primary keys are almost always a string (an uppercase string).

    Naturally, we both want to do this the way we are used to. But, which is the best way? Are the performance issues of int vs varchar that great in SQL 2005?

    Thanks.
    • Moved by VMazur Wednesday, July 08, 2009 10:37 AM (From:ADO.NET Data Providers)
    Wednesday, July 11, 2007 3:02 AM

Answers

  • If you're designing a SQL Server database for 3-10 users and your tables have 5000 rows in them, worrying about the relative efficienty of different datatypes for your primary keys falls into the category of premature optimization.  In other words, no, you're not going to find significant performance differences.
    Thursday, July 12, 2007 2:09 PM
  • Typically, a question like this triggers discussions on how to use the int or varchar. In other words, discussion on using natural keys or surrogate keys, and on whether Identity is a good choice of surrogate key.

    But when it comes to performance, it is all much simpler and less dependent on opinion. Assuming you never update your Primary Key values, then these are the relevant factors:

    1. int always requires exactly 4 bytes for storage. Varchar requires 2 plus the the size of the stored value. Also, there is storage overhead for the first varchar in the table. Since a primary key of only 2 letters isn't useful for big tables (tables for which the performance is important), you could say that varchar will always use more storage, which is 'slower'. Note that the clustered index key is included in all nonclustered indexes. So if the Primary Key is enforced with the clustered index, then a wide primary key will also make all nonclustered indexes more expensive.

    2. Comparing two int values is very efficient. Comparing two varchar values with default collations is much more expensive, because of case insensitivity and accent insensitivity. So under normal circumstances scanning large amount of varchar values will be slower than scanning large amounts of int

    If you try hard, and want character based keys instead of integers, then these two definitions will perform about equally.

    my_integer_key int
    my_character_key char(4) COLLATE Latin1_General_BIN
    

    Note that working with a character string with nonstandard collation means you will have to match any local variable's or parameter's declaration that has to hold a primary key value. So using int makes for easier coding.

    Having said all this, I agree with the other poster that said that the performance of this is not a concern unless the table grows really large, or if your varchar values are very long. For small and medium sized tables, ease of use and policies on key choice are more important.


    Gert-Jan

    Thursday, February 07, 2013 9:41 PM

All replies

  • You need to ask the customer who will pay for the application because big companies with big projects will not pay for auto number which in SQL Server is IDENTITY, they prefer natural key which is the Varchar.  But it really depends on the customer and their preference and needs.  Hope this helps.
    Wednesday, July 11, 2007 3:29 AM
  • Thanks for your answer,

    This development isnt for a specific client - its something that we hope to sell. Our clients will therefore be largely unaware of the underlying data structure. We are trying to get a feel for what is the best trade off between performance and maintainability.

    Can you tell me why these companies prefer varchar over ints (identity)?
    Wednesday, July 11, 2007 3:35 AM
  • All three of you are missing the point of primary keys altogether. Sorry, for the boldness, but it seems to be required.

     

    Primary keys can be any datatype. Their sole purpose is to identify uniqueness in the records. Coincidently, primary keys have indexes that enhance performance, however, and index can be set on any unique column that isn't a primary or foreign key.

     

    There is no argument on either side whether or not a varchar or int is more or less sufficient. It's very possible that both might be concatentated as the primary key.

     

    Relational database design is universal. The environment is no factor, although the approach does change based on the forgiveness of the query optimizer.

     

    As an aside, unless required, I personally go to great lengths to avoid IDENTITY (autonumber primary key) fields. I would go to the extreme of using every field as the primary key if that made the uniqueness. I would even use no primary if duplicates are acceptable.

     

    The point is, if you don't "need" a primary key, don't abuse autonumber-identity fields if they're not necessary.

     

    Furthermore, if you ever only INSERT and SELECT from the table and you can't find a uniqueness, don't use a primary key. It is only when you begin to UPDATE and DELETE that the focus on the primary key is necessary.

     

    In conclusion, I can't really cover the exceptions in a small thread but I wouldn't be opposed to entertain and follow up requests.

     

    Adamus

     

     

    Wednesday, July 11, 2007 3:48 AM
  • If your database is of any size, and you are doing a great deal of queries/updates, you might want to revisit the surrogage key (int/guid) approach.
    Wednesday, July 11, 2007 6:10 AM
  •  ericrtodd wrote:
    If your database is of any size, and you are doing a great deal of queries/updates, you might want to revisit the surrogage key (int/guid) approach.

     

    Eric, although it's not good practice to create varchar() primary keys for the sake of primary keys (when possible and for performance reasons), it can be necessary to create the uniqueness.

     

    For example, if you had a table of cities and states, both fields would be varchar and the city would be the primary key. In this specific situation, you would not create an identity field simply to have an integer primary key. It just wouldn't make sense.

     

    Yes, this is a small scale example. But no, even on a large scale it is not discouraged or unpracticed to use even varchar(50) as a primary. It is a necessary evil outside of redesigning the table structure which most often is not an option in a long standing production environment.

     

    Just my twist on it,

     

    Adamus

    Wednesday, July 11, 2007 6:21 AM
  • Well, typically we are talking about master file information, like customers, suppliers, stock - these are most likely to number in the hundreds or thousands (possibly tens of thousands - very unlikely to be hundreds of thousands).
    Wednesday, July 11, 2007 8:03 AM
  • > Primary keys can be any datatype. Their sole purpose is to identify uniqueness
    > in the records.


    Not so.  Their sole purpose is to uniquely identify the record.  This is different from identifying what is unique in the record.

    > For example, if you had a table of cities and states, both fields would be varchar
    > and the city would be the primary key. In this specific situation, you would not create
    > an identity field simply to have an integer primary key. It just wouldn't make sense.


    This is completely incorrect, for reasons that I'll get to in a moment.

    > Furthermore, if you ever only INSERT and SELECT from the table and you can't find
    > a uniqueness, don't use a primary key. It is only when you begin to UPDATE and
    > DELETE that the focus on the primary key is necessary.

    You left out "relational integrity."  In fact, everyone posting to this thread seems to have left that out.

    The great strength of synthetic keys in normalized databases is that they allow data in any column to change independently of any relationships that the row participates in.  To use the above example of cities:  yes, a combination of state code and city name is sufficient to uniquely identify a city, but it introduces tremendous inefficiencies if other tables have to join to that table, especially  in the not-unlikely case that a city name is found to have been misspelled after the database is populated with thousands of rows referencing it.

    The sole cost to using an int IDENTITY column as the PK to a table is that you can't know what the PK of a row will be before you insert it.  But the benefits of using it are overwhelming:
    • Identity PKs are absolutely invariant.  It may be possible to assert that a given non-identity PK is invariant, but often that assertion is incorrect.  For instance:  I do a lot of work with court systems.  One of the default rules of courts is that case numbers never change.  So every case-management I've ever seen uses the case number as the PK to the case table, and all other tables join to the case table by the case number.  Every case-management application I've ever seen also includes a utility to change the case number on a case, because it turns out that case numbers do change.
    • Every table uses the same identity method.  There is no need to ask "how is the PK of this table formulated?" when writing a query that JOINs to it, or when designing an association table.  You can write "InvoiceID int NOT NULL FOREIGN KEY FK_ThisTable_InvoiceID REFERENCES Invoice (InvoiceID)" without having to ask yourself "How many characters is InvoiceNum again?"  You can change the length of InvoiceNum without affecting any of the tables that join to Invoice.
    • You never replicate data outside of the table that contains it.  If you have a master table containing the names of all the attorneys in the country, you don't have to include any data except the attorney row's unique identifier in any other table that joins to it.
    • While they're absolutely synthetic, int IDENTITY columns still provide one very useful piece of information:  the order in which rows were inserted into the table.  Other PK schemes do not.
    There are certainly circumstances in which an int IDENTITY PK isn't the right answer.  (They're not sufficient to support replication scenarios, for instance.)  But most of the time it is.
    Wednesday, July 11, 2007 8:54 AM
  • Robert you make some very good points on the exceptions but you are negating the rules. When you have gained more experience working with relational databases, please revisit this thread as it may make more sense.

     

    In my experience, I have found the abuse of PK's solely based on not knowing the correct or better way to choose.

     

    For example, some use IDENTITY columns to determine chronology. This is bad practice. Timestamps are the correct choice.

     

    The rule, not the exceptions, is that IDENTITY columns are used for matching single records for updates and deletes (only IF a PK doesn't exist). Outside of that, there is no good reason to have one.

     

    If you have any further questions, please don't hesitate to ask. I'm always here to help.

     

    Adamus

     

     

    Wednesday, July 11, 2007 1:04 PM
  • While I appreciate all of these answers, but they arnt quite answering the original question (maybe I worded it badly).

     

    What I want to know is, in MS SQL 2005, is the performance difference between using an int and a varchar as a primary key significant? Apparently it is in Access, which is why autonumber is used as a primary key so much. I'm not talking a huge database with hundreds of users - I'm talking a small database, probably around 3-10 users, and tables that are likely to have up to 5,000 records or so (customer/vendor/stock master files).

     

    Thanks.

    Wednesday, July 11, 2007 9:28 PM
  •  fweeee wrote:

    While I appreciate all of these answers, but they arnt quite answering the original question (maybe I worded it badly).

     

    What I want to know is, in MS SQL 2005, is the performance difference between using an int and a varchar as a primary key significant? Apparently it is in Access, which is why autonumber is used as a primary key so much. I'm not talking a huge database with hundreds of users - I'm talking a small database, probably around 3-10 users, and tables that are likely to have up to 5,000 records or so (customer/vendor/stock master files).

     

    Thanks.

     

    I think you need to see the data design of similar applications and add flexible changes with user needs in mind, so your customers can add and remove tables without issues with your original application.  You cannot compare Access to SQL Server, I use both SQL Server and Oracle all the time but I don't use Access.  What I am saying  is if you take time to design your database in SQL Server you will not run into serious performance issues because there are a lot of things you can do in Access that you cannot do in SQL Server.  The link below are many ready to use data models you can modify and use as needed.  Hope this helps.

     

    http://www.databaseanswers.org/data_models/index.htm

    Wednesday, July 11, 2007 9:51 PM
  • If you're designing a SQL Server database for 3-10 users and your tables have 5000 rows in them, worrying about the relative efficienty of different datatypes for your primary keys falls into the category of premature optimization.  In other words, no, you're not going to find significant performance differences.
    Thursday, July 12, 2007 2:09 PM
  • I know this is a very old thread, but I have an identical question.  I wonder if fweeee is still active and would like to update us with what conclusion you came to in your question of whether to use INT or VARCHAR for PK columns?

    My situation is very similar: designing an SQL db for potentially large web app.  PK options for me would be like this:

    sys_id int
    user_id int

    (example values 10, 1001)

    or like this

    user_id varchar(50)

    (example value 10-1001)

    I would most likely not use IDENTITY property for the INT option, as that one drawback of IDENTITY being that you can't know what the PK is until after the row has been inserted into the database.  Either way we will most likely write a web server component that manages numeric key values on each node of the web server.  Either way the unique key would be maintained as a concatenation of the system_id (numeric code of the webserver node) and an incremental numeric value of identifying the record on that node.

    The question then becomes purely academic around which data type provides faster joins .....  either joining on a single varchar(50) column made up of 9999-9999999 or joining on 2 INT (or BIGINT) columns made up of 9999 and 9999999.

    Preliminary tests are inconclusive, with the varchar(50) seeming to win by a tiny margin.

    Any other ideas or arguments?

    thanks!!

    Tuesday, July 07, 2009 6:20 PM
  • Well this is an old thread - I just stumbled onto it again (I'm thinking that SQLforGirls has solved his/her problem by now.

    I ended up going for ints (or bigints) for my primary keys, setting them to identity (so I didnt have to worry about generating them).

    Yes, the fact that I dont know the primary key of a field before it is inserted is a bit annoying (especially when entering data for something with a parent/child relationship, while allowing the user to change their mind and undo it all by pressing escape). But ultimatly, I went with ints for the following reasons:
    1) Slightly faster (although my database is not so big that our users will actually notice).
    2) Users can easiliy change the name/code without it also having to update any children records

    There might have been some other reasons, but I forgot - I made this decision so long ago.
    Monday, September 21, 2009 12:40 PM
  • I read all answers, but none answer the simple question, what should choose int or vchar!!!

    I find the following link that explain about the overhead for PK:

    http://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/

    Wednesday, February 06, 2013 12:45 PM
  • >I read all answers, but none answer the simple question, what should choose int or vchar!!!

    Choose INT always over varchar.

    Proof: EmailAddress (varchar(70)) takes more space and  many times slower than EmailID (INT) in JOINs. Not to mention if EmailAddress ( e.g. madonna@superhotcelebrities.com) used as a FOREIGN KEY, PK table content is duplicated in each FK table column (3NF rule violation in effect).


    Kalman Toth Database & OLAP Architect
    Paperback / Kindle: SQL Server 2012 Pro


    Thursday, February 07, 2013 6:36 PM
    Moderator
  • It is a very long discussion in regards to surrogate vs. natural keys. I think my last reference in this quiz is really good summary of this topic

    http://beyondrelational.com/quiz/sqlserver/general/2010/questions/sqlserver-quiz-general-2010-madhu-k-nair-surrogate-key-vs-natural-key.aspx?pg=3


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    Thursday, February 07, 2013 6:42 PM
    Moderator
  • Typically, a question like this triggers discussions on how to use the int or varchar. In other words, discussion on using natural keys or surrogate keys, and on whether Identity is a good choice of surrogate key.

    But when it comes to performance, it is all much simpler and less dependent on opinion. Assuming you never update your Primary Key values, then these are the relevant factors:

    1. int always requires exactly 4 bytes for storage. Varchar requires 2 plus the the size of the stored value. Also, there is storage overhead for the first varchar in the table. Since a primary key of only 2 letters isn't useful for big tables (tables for which the performance is important), you could say that varchar will always use more storage, which is 'slower'. Note that the clustered index key is included in all nonclustered indexes. So if the Primary Key is enforced with the clustered index, then a wide primary key will also make all nonclustered indexes more expensive.

    2. Comparing two int values is very efficient. Comparing two varchar values with default collations is much more expensive, because of case insensitivity and accent insensitivity. So under normal circumstances scanning large amount of varchar values will be slower than scanning large amounts of int

    If you try hard, and want character based keys instead of integers, then these two definitions will perform about equally.

    my_integer_key int
    my_character_key char(4) COLLATE Latin1_General_BIN
    

    Note that working with a character string with nonstandard collation means you will have to match any local variable's or parameter's declaration that has to hold a primary key value. So using int makes for easier coding.

    Having said all this, I agree with the other poster that said that the performance of this is not a concern unless the table grows really large, or if your varchar values are very long. For small and medium sized tables, ease of use and policies on key choice are more important.


    Gert-Jan

    Thursday, February 07, 2013 9:41 PM
  • This practical script extremely explains why primary key of both string and date field is better than a dummy identity field:

    --IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[CharReceipt]') AND type in (N'U'))
    --DROP TABLE [CharReceipt]
    --GO

    --IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[CharReceiptDetails]') AND type in (N'U'))
    --DROP TABLE [CharReceiptDetails]
    --GO

    --IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[IntReceipt]') AND type in (N'U'))
    --DROP TABLE [IntReceipt]
    --GO

    --IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[IntReceiptDetails]') AND type in (N'U'))
    --DROP TABLE [IntReceiptDetails]
    --GO

    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[CharReceipt]') AND type in (N'U'))
    BEGIN
    CREATE TABLE [CharReceipt](
        [ReceiptNo] [nvarchar](20) NOT NULL,
        [ReceiptDate] [datetime] NULL,
        [Remarks] [nvarchar](max) NULL,
     CONSTRAINT [PK_CharReceipt] PRIMARY KEY CLUSTERED
    (
        [ReceiptNo] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
    ) ON [PRIMARY]
    END
    GO

    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[CharReceiptDetails]') AND type in (N'U'))
    BEGIN
    CREATE TABLE [CharReceiptDetails](
        [ReceiptNo] [nvarchar](20) NOT NULL,
        [AccountName] [nvarchar](50) NOT NULL,
        [Value] [decimal](18, 5) NULL,
        [Remarks] [nvarchar](max) NULL,
        [ReceiptDate] [datetime] NULL,
     CONSTRAINT [PK_CharReceiptDetails] PRIMARY KEY CLUSTERED
    (
        [ReceiptNo] ASC,
        [AccountName] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
    ) ON [PRIMARY]
    END
    GO

    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[IntReceipt]') AND type in (N'U'))
    BEGIN
    CREATE TABLE [IntReceipt](
        [ID] [int] IDENTITY(1,1) NOT NULL,
        [ReceiptNo] [nvarchar](20) NOT NULL,
        [ReceiptDate] [datetime] NULL,
        [Remarks] [nvarchar](max) NULL,
     CONSTRAINT [PK_IntReceipt] PRIMARY KEY CLUSTERED
    (
        [ID] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
    ) ON [PRIMARY]
    END
    GO

    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[IntReceiptDetails]') AND type in (N'U'))
    BEGIN
    CREATE TABLE [IntReceiptDetails](
        [ID] [int] IDENTITY(1,1) NOT NULL,
        [ReceiptID] [int] NOT NULL,
        [AccountName] [nvarchar](50) NOT NULL,
        [Value] [decimal](18, 5) NULL,
        [Remarks] [nvarchar](max) NULL,
     CONSTRAINT [PK_IntReceiptDetails] PRIMARY KEY CLUSTERED
    (
        [ID] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
    ) ON [PRIMARY]
    END
    GO


    declare @i int, @BeginDate datetime, @CurrentDate datetime, @ReceiptNo nvarchar(20), @IntIdentity int, @BeginExecution datetime, @EndExecution datetime,
     @CharAndDateQueryTimeInMS int, @SecondQueryTimeInMS int, @CharQueryTimeWithOneField int
    set @i = 1
    set @BeginDate = GETDATE()
    while(@i < 500000)
    begin
        set @ReceiptNo = RIGHT('0000000000' + CAST(@i as nvarchar(20)), 10)
        set @CurrentDate = DATEADD(SECOND,@i,@BeginDate)
        if(not Exists(select top 1 ReciptNo from CharReceipt where ReciptNo = @ReceiptNo))
        begin
            insert into CharReceipt values(@ReceiptNo, @CurrentDate, null)
            insert into CharReceiptDetails values(@ReceiptNo, 'Account One' , 50, null, @CurrentDate)
            insert into CharReceiptDetails values(@ReceiptNo, 'Account Two' , 60, null, @CurrentDate)
            insert into CharReceiptDetails values(@ReceiptNo, 'Account Three' , 10, null, @CurrentDate)
        end
        if(not Exists(select top 1 ReceiptNo from IntReceipt where ReceiptNo = @ReceiptNo))
        begin
            insert into IntReceipt values(@ReceiptNo, @CurrentDate, null)
            set @IntIdentity = @@IDENTITY
            insert into IntReceiptDetails values(@IntIdentity, 'Account One' , 50, null)
            insert into IntReceiptDetails values(@IntIdentity, 'Account Two' , 60, null)
            insert into IntReceiptDetails values(@IntIdentity, 'Account Three' , 10, null)
        end
        set @i = @i + 1
    end

    set @BeginExecution = GETDATE()
    SELECT  CharReceipt.ReceiptNo, CharReceipt.ReceiptDate, CharReceipt.Remarks, CharReceiptDetails.AccountName, CharReceiptDetails.Value,
            CharReceiptDetails.Remarks AS DetailsRemarks
    FROM    CharReceipt INNER JOIN
            CharReceiptDetails ON CharReceipt.ReceiptNo = CharReceiptDetails.ReceiptNo AND CharReceipt.ReceiptDate = CharReceiptDetails.ReceiptDate
    set @EndExecution = GETDATE()
    set @CharAndDateQueryTimeInMS = DATEDIFF(MS, @BeginExecution, @EndExecution)
    set @BeginExecution = GETDATE()
    SELECT    IntReceipt.ReceiptNo, IntReceipt.ReceiptDate, IntReceipt.Remarks, IntReceiptDetails.AccountName, IntReceiptDetails.Value,
            IntReceiptDetails.Remarks AS DetailsRemarks
    FROM    IntReceipt INNER JOIN
            IntReceiptDetails ON IntReceipt.ID = IntReceiptDetails.ReceiptID
    set @EndExecution = GETDATE()
    set @SecondQueryTimeInMS = DATEDIFF(MS, @BeginExecution, @EndExecution)
    set @BeginExecution = GETDATE()
    SELECT  CharReceipt.ReceiptNo, CharReceipt.ReceiptDate, CharReceipt.Remarks, CharReceiptDetails.AccountName, CharReceiptDetails.Value,
            CharReceiptDetails.Remarks AS DetailsRemarks
    FROM    CharReceipt INNER JOIN
            CharReceiptDetails ON CharReceipt.ReceiptNo = CharReceiptDetails.ReceiptNo
    set @EndExecution = GETDATE()
    set @CharQueryTimeWithOneField = DATEDIFF(MS, @BeginExecution, @EndExecution)

    select @CharAndDateQueryTimeInMS as CharAndDateTableSelectInMS, @SecondQueryTimeInMS as IntTableSelectInMS, @CharQueryTimeWithOneField as OneFieldCharTableInMS

    Results

    CharAndDateTableSelectInMS    IntTableSelectInMS    OneFieldCharTableInMS
    39603                                        33516                         26890

    So by these results we conclude that a compiste key of a string and date fields are too bad in joining with other tables and an integer identity field is better than this solution but a string field only is better than both of them!!

    Friday, May 03, 2013 7:51 AM