none
Using varchar for primary key value ok?

    Question

  • I'm planning on storing product information in a table and I was going to use varchar for my datatype in my product_id column because all of the product ids aren't integers.  Is there any issues with this?

    Thanks

    Sunday, January 16, 2011 7:36 PM

Answers

  • Varchar takes up more space than integers, and there is more processing required to work with character data.

    Then again, database design should not be determined from performance alone. If the product ids are non-numeric, it could be alright to use it. But if you expect the transactional tables (Orders, Invoices, Whatever) to be big, it may be better to introduce a surrogate key and have the non-numeric key only in the Products table. This is particularly true if the id for a product could change.

    Another thing to watch for if you use varchar, and you don't use stored procedures. Some client environments assume Unicode by default, and will construct queries with Unicode data types or literals. If you have a query that goes:

    SELECT ... FROM products WHERE product_id = N'ABC123'

    The column will be converted to nvarchar, because of the data-type precedence rules in SQL Server. The exact impact depends on your collation, but it is definitely not good for performance.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Links for SQL Server Books Online: SQL 2008, SQL 2005 and  SQL 2000.
    (Just click the link you need.)
    • Proposed as answer by Naomi NModerator Sunday, January 16, 2011 9:32 PM
    • Marked as answer by KJian_ Monday, January 24, 2011 9:25 AM
    Sunday, January 16, 2011 7:57 PM
  • Hi Westside2008,

    As innocent as this question looks, it's far from simple. There have
    been holy wars fought out over this.

    Some people will prefer to add an identity column as surrogate key to
    every table, and then add a unique constraint to the product_id
    column. Other people will prefer to use the "real" key as the priimary
    key.

    Both approaches have advantages and disadvantages.

    Advantages of identity surrogate keys over real keys only:

    1. Number is always increasing as new rows are added - good for insert
    performance. With only real key, you get page splits as new data is
    inserted out of order. However, this holds only if the index used to
    support the primary key is clustered (the default); if you have
    reasons to choose another clustered index, this arguments becomes
    void.

    2. Never problems with key changing, even if the "real" key sometimes
    will change. Just change it in the table; all referencing tables use
    the identity value, that is still unchanged.
    3. Referencing columns are relatively small and joins are relatively
    fast if only 8-byte integers are used. Character columns, especially
    long strings, tend to be a bit slower.

    Disadvantages of identity over real keys only:

    1. Sometimes an extra join is required - for instance, if the
    product_id has to be included in a report but no other data from the
    products table, than a design with real key only will already have the
    product_id in the referencing table; if you use a surrogate key, you
    need to join to the products table. This makes queries more
    complicated and a bit slower.

    2. More data is required in the table itself, and more indexes are
    used. This takes more hard disk space. (Though you can also hard disk
    space if references are implemented using an integer instead of a
    longer column).

    3. People tend to forget to enforce the uniqueness of the real key; in
    that case, duplicates can enter the database and sit there for a long
    time undetected, causing situations that can be difficult to repair.

    4. If you expose surrogate key values to the user (which you actually
    should never do, according to the designers of the relational model),
    you may get questions - for instance about gaps in the identity
    sequence. Or customers might see from the key values that your
    business is very low.

    Note that numbers 3 and 4 are not actually problems with surrogate
    keys, but with improper use of them.

    My personal preference is to first create a logical design that does
    not use any surrogate keys. Then, I decide for each individual table
    whether or not to add a surrogate key. If a table is not referenced, I
    never add a suurrogate. If a table is referenced and has a compound
    key of three or more columns, I almost always add a surrogate key;
    also if the real key includes a string column (or more). The rest of
    the tables, I decide on a case by case basis; no hard and fast rules.


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    Sunday, January 16, 2011 8:06 PM
  • No, any data type can be a key. But most product ids, SKUs, etc. are industry standards (VIN for automobiles, ISBN for books, ISAN for audio-visual media, etc.) and they are fixed length with some kind of validation, like a regular expression or a check digit. 

    Newbies hate this part of RDBMS; you have to actually design data and not just code away. And it is hard work. Lazy bastards will jut throw a meaningless, machine generated GUID, IDENTITY or whatever on the row to mimic a file-system pointer and use it. They lose validation and verification, cannot port their data and have other problems. 

    Do some research and see what your industry uses. If you need to actually design your own encoding, I have a few chapters on how to do that in SQL PROGRAMING STYLE. 


    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL
    Monday, January 17, 2011 9:58 PM

All replies

  • In most cases, it's better to use "meaningless" sequence numbers (surrogate keys, see http://en.wikipedia.org/wiki/Surrogate_key) as primary key contents. Due to their missing meanings, it's never necessary to change them. And the content of a primary key column should never be changed, otherwise you have to change the foreign key content in neighbored tables, too.

    Regards,

    Klaus

    Sunday, January 16, 2011 7:49 PM
  • Varchar takes up more space than integers, and there is more processing required to work with character data.

    Then again, database design should not be determined from performance alone. If the product ids are non-numeric, it could be alright to use it. But if you expect the transactional tables (Orders, Invoices, Whatever) to be big, it may be better to introduce a surrogate key and have the non-numeric key only in the Products table. This is particularly true if the id for a product could change.

    Another thing to watch for if you use varchar, and you don't use stored procedures. Some client environments assume Unicode by default, and will construct queries with Unicode data types or literals. If you have a query that goes:

    SELECT ... FROM products WHERE product_id = N'ABC123'

    The column will be converted to nvarchar, because of the data-type precedence rules in SQL Server. The exact impact depends on your collation, but it is definitely not good for performance.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Links for SQL Server Books Online: SQL 2008, SQL 2005 and  SQL 2000.
    (Just click the link you need.)
    • Proposed as answer by Naomi NModerator Sunday, January 16, 2011 9:32 PM
    • Marked as answer by KJian_ Monday, January 24, 2011 9:25 AM
    Sunday, January 16, 2011 7:57 PM
  • It is perfectly acceptable to use a varchar column as the primary key.  This is often the case when one uses a natural key that doesn't happen to be an integer.  Keep in mind that even if you introduce a surrogate as the pimary key, you'll still need to create a unique constraint on product_id.

     


    Dan Guzman, SQL Server MVP, http://weblogs.sqlteam.com/dang/
    Sunday, January 16, 2011 8:02 PM
  • Hi Westside2008,

    As innocent as this question looks, it's far from simple. There have
    been holy wars fought out over this.

    Some people will prefer to add an identity column as surrogate key to
    every table, and then add a unique constraint to the product_id
    column. Other people will prefer to use the "real" key as the priimary
    key.

    Both approaches have advantages and disadvantages.

    Advantages of identity surrogate keys over real keys only:

    1. Number is always increasing as new rows are added - good for insert
    performance. With only real key, you get page splits as new data is
    inserted out of order. However, this holds only if the index used to
    support the primary key is clustered (the default); if you have
    reasons to choose another clustered index, this arguments becomes
    void.

    2. Never problems with key changing, even if the "real" key sometimes
    will change. Just change it in the table; all referencing tables use
    the identity value, that is still unchanged.
    3. Referencing columns are relatively small and joins are relatively
    fast if only 8-byte integers are used. Character columns, especially
    long strings, tend to be a bit slower.

    Disadvantages of identity over real keys only:

    1. Sometimes an extra join is required - for instance, if the
    product_id has to be included in a report but no other data from the
    products table, than a design with real key only will already have the
    product_id in the referencing table; if you use a surrogate key, you
    need to join to the products table. This makes queries more
    complicated and a bit slower.

    2. More data is required in the table itself, and more indexes are
    used. This takes more hard disk space. (Though you can also hard disk
    space if references are implemented using an integer instead of a
    longer column).

    3. People tend to forget to enforce the uniqueness of the real key; in
    that case, duplicates can enter the database and sit there for a long
    time undetected, causing situations that can be difficult to repair.

    4. If you expose surrogate key values to the user (which you actually
    should never do, according to the designers of the relational model),
    you may get questions - for instance about gaps in the identity
    sequence. Or customers might see from the key values that your
    business is very low.

    Note that numbers 3 and 4 are not actually problems with surrogate
    keys, but with improper use of them.

    My personal preference is to first create a logical design that does
    not use any surrogate keys. Then, I decide for each individual table
    whether or not to add a surrogate key. If a table is not referenced, I
    never add a suurrogate. If a table is referenced and has a compound
    key of three or more columns, I almost always add a surrogate key;
    also if the real key includes a string column (or more). The rest of
    the tables, I decide on a case by case basis; no hard and fast rules.


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    Sunday, January 16, 2011 8:06 PM
  • Thanks for all the great responses. I am going to keep the primary key (product_id) as an integer that auto increments, and add another column (item_id) or something to store the actual product id, and that column will be of varchar data type as you've mentioned Erland.

    If I've missed something please let me know otherwise I will just go with that.....

    -Westside2008

    Sunday, January 16, 2011 8:09 PM
  • 3. Referencing columns are relatively small and joins are relatively
    fast if only 8-byte integers are used.

    Unless Westside to sell the entire universe of products, probably a 4-byte integer would do!

    (But there is all reason when you introduce a surrogate key to consider whether 32 bits will be enough. For a product catalog, no sweat, but tables with transactions, orders etc it may be an issue.)


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Links for SQL Server Books Online: SQL 2008, SQL 2005 and  SQL 2000.
    (Just click the link you need.)
    Sunday, January 16, 2011 11:04 PM
  • No, any data type can be a key. But most product ids, SKUs, etc. are industry standards (VIN for automobiles, ISBN for books, ISAN for audio-visual media, etc.) and they are fixed length with some kind of validation, like a regular expression or a check digit. 

    Newbies hate this part of RDBMS; you have to actually design data and not just code away. And it is hard work. Lazy bastards will jut throw a meaningless, machine generated GUID, IDENTITY or whatever on the row to mimic a file-system pointer and use it. They lose validation and verification, cannot port their data and have other problems. 

    Do some research and see what your industry uses. If you need to actually design your own encoding, I have a few chapters on how to do that in SQL PROGRAMING STYLE. 


    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL
    Monday, January 17, 2011 9:58 PM
  • Thanks for all the great responses. I am going to keep the primary key (product_id) as an integer that auto increments, and add another column (item_id) or something to store the actual product id, and that column will be of varchar data type as you've mentioned Erland.

    If I've missed something please let me know otherwise I will just go with that.....

    -Westside2008

    Q:Why did we invent databases? A:to remove redundancy  and later to ensure data integrity. 

    Your "magic number" is redundant. You still have to keep the real product_id unique and valid (check constraints for check digits, or regular expressions). This idea started 3 decades ago with Sybase on 16 bit hardware and disk drives with less storage than your MP3 player. Today, we have 64 bit hardware and SSD. Is this product_id going to be more than 8 letters (1 word) long? Maybe two words long? Those comparisons are machine instructions now. 

    I can post a whole rant with examples of how this waste disk, destroys portability, destroys data exchange,  and causes dangerous integrity problems. But Google around for yourself and  don't be lazy. 


    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL
    Monday, January 17, 2011 10:12 PM
  • I'm planning on storing product information in a table and I was going to use varchar for my datatype in my product_id column because all of the product ids aren't integers.  Is there any issues with this?

    Thanks


    Use surrogate key ID INT IDENTITY(1,1) for Primary Key. Make the product_id column UNIQUE KEY or place a UNIQUE INDEX on it.

    Example:

    -- Columns from AdventureWorks2008.Production.Product
    CREATE TABLE Product(
    	ProductID int IDENTITY(1,1) PRIMARY KEY,
    	ProductSubcategoryID int NULL REFERENCES ProductSubcategory(ProductSubcategoryID),
    	ProductModelID int NULL REFERENCES ProductModel(ProductModelID),
    	Name varchar(50) NOT NULL UNIQUE,
    	ProductNumber nvarchar(25) UNIQUE,
    	ListPrice money NOT NULL DEFAULT (0.0),
    	Size nvarchar(5) NULL,
    	SizeUnitMeasureCode nchar(3) NULL,
    	WeightUnitMeasureCode nchar(3) NULL,
    	Weight decimal(8, 2) NULL,
    	ProductLine nchar(2) NULL,
    	Class nchar(2) NULL,
    	Style nchar(2) NULL,
    	rowguid uniqueidentifier ROWGUIDCOL NOT NULL UNIQUE,
    	ModifiedDate datetime DEFAULT (getdate()));
    

    Related thread: http://social.msdn.microsoft.com/Forums/en-US/sqlgetstarted/thread/27bd9c77-ec31-44f1-ab7f-bd2cb13129be/


    Kalman Toth SQL SERVER 2012 & BI TRAINING
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012
    Saturday, January 22, 2011 6:59 AM
  • Some people will prefer to add an identity column as surrogate key to
    every table, and then add a unique constraint to the product_id
    column. Other people will prefer to use the "real" key as the priimary
    key.

    Both approaches have advantages and disadvantages.

    Additional consideration.

    Newbies should definitely go with ID INT IDENTITY(1,1) PRIMARY KEY.

    Using Natural Key as SQL Server db table Primary Key belongs to the domain of database experts.


    Kalman Toth SQL SERVER 2012 & BI TRAINING
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012
    Saturday, January 22, 2011 7:16 AM
  • Newbies should definitely go with ID INT IDENTITY(1,1) PRIMARY KEY.

    Using Natural Key as SQL Server db table Primary Key belongs to the domain of database experts.


    No. Au contraire.

    This is a bit like saying that driving a car with manual gears is for experts and new drivers should use automatic gears only.

    It is not until you properly understand natural keys, and you should starting surrogates. If you start off using surrogate key, you may never learn abouit natural keys, and more importantly to do a proper database design. That's when you get a table where the only non-nullable column is an IDENTITY column, and the rest is nullable. That's the people who will produce tables like:

    CREATE TABLE product_categories (
        id          int IDENTITY PRIMARY KEY,
        product_id  int NULL,
        category_id int NULL
    )

    These databases will have all sorts of problems with duplicates, inconsistent data. And it will take a real expert to clean up the mess.

    Which is not needed if you leart do to it right from the beginning.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Links for SQL Server Books Online: SQL 2008, SQL 2005 and  SQL 2000.
    (Just click the link you need.)
    Saturday, January 22, 2011 11:22 AM
  • I'm with Erland on this one.  I often see newbies add an identity column to every table (often named ID) and call it the primary key.  This is a just a way to avoid the upfront work needed for a good database design.  There is simply no substitute for the analysis needed to design a database properly with constraints that preserve data integrity.

    I was very fortunate to have taken a database design class long before I started in the field and relational databases became popular.  I was by no means an expert but I knew enough to design (at least) a 3NF database without introducing surrogate keys.  It was only later than I learned how to properly introduce surrogates into the design.

    IMHO, newbies should start with natural keys and leave surrogates to the domain of database experts.

     


    Dan Guzman, SQL Server MVP, http://weblogs.sqlteam.com/dang/
    Saturday, January 22, 2011 12:13 PM
  • This is a bit like saying that driving a car with manual gears is for experts and new drivers should use automatic gears only.

    Precisely, that is how we think in America. In Germany, the only country in the world without speed limit, the thinking is manual shifting first.

    Newbies without database design skills should follow AdventureWorksxxxx sample databases which use surrogate identities for Primary Key and unique indexes on Natural Keys (see Production.Product) mostly or unique keys.

    -- UNIQUE key constraint query
    select * from sys.objects 
    where type_desc like '%unique%' order by name
    

    Kalman Toth SQL SERVER 2012 & BI TRAINING
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012
    Saturday, January 22, 2011 2:30 PM
  • There is simply no substitute for the analysis needed to design a database properly with constraints that preserve data integrity.

    Agreed. But that is not for newbies.

    Dan, have you noticed that Database Design expert and DBA are the last ones to be hired on a project? Usually when the project is in trouble?  Most projects start out with reluctant developers "designing" databases and "administering" SQL Servers.

    Hiring a database design expert for a month or so consulting can be invaluable to the success of the database development project.

    Kalman Toth SQL SERVER 2012 & BI TRAINING
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012

    Saturday, January 22, 2011 2:35 PM
  • Precisely, that is how we think in America. In Germany, the only country in the world without speed limit, the thinking is manual shifting first.

    Yeah, I am aware that in North America automatic gears are popular. The first time I ever encountered one was when I rented a car in Canada.

    Elsewhere, you can expect your rental cars to have manual gears. As the car you will drive when you pass the test to get the driving license.

    Newbies without database design skills should follow AdventureWorksxxxx sample databases which use surrogate*identities* for Primary Key and *unique indexes* on Natural Keys (see Production.Product) mostly or unique keys.

    No. As I understand it, you actually work with teaching SQL. It makes me very worried when you suggest complete misconceptions.

    You cannot use surrogate keys until you know what they are surrogates for. To learn database design, you first need to know what keys are and how they are used to identify data.

    Once you have learned the rules, you can learn how to cheat.

    But if you start by cheating, you will never learn the rules.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Links for SQL Server Books Online: SQL 2008, SQL 2005 and  SQL 2000.
    (Just click the link you need.)
    Saturday, January 22, 2011 3:16 PM