locked
VARCHAR as Primary Key RRS feed

  • Question

  • Hello there

    I am designing a database for one of my projects where in the table fields looks as below...

    Address                                   VARCHAR(100)   ----(or may be I am thinking of VARBINARY(512)
    SomeXML                                 XML
    SomeLargeXMLDataChunk1       XML
    SomeLargeXMLDataChunk2       XML
    SomeLargeXMLDataChunk3       XML
    ADateTimeCol                                 DATETIME

    The data is expected to grow on an average as 50,000 records a day
    Application or any other program that needs to retreive a record from this table can "only" pass "Address" as search/query parameter.
    So it appears that I have no choice but to go for a primary key on Address column.
    Being a VARCHAR column (or please suggest if any other better option) & data expected to grow at rate of 50,000 recrds per day, can you suggest any optimal way of dealing with scenario?

    Any help is appreciated

    Please let me know.

    Thanks

    Saturday, December 18, 2010 6:30 AM

Answers

  • Hi Deepak,

    My first question is about the comment on this line:

    Address                                   VARCHAR(100)   ----(or may be I am thinking of VARBINARY(512)

    Why on earth would you ever use varbinary to store an address?

    To answer your question, I need a bit more information. Are there
    other tables referencing this table? If so, I would advice you to add
    a surrogate key (IDENTITY column), use that as the primary key and
    suffice with a unique constraint on the Address column. The IDENTITY
    value can then be used to implement foreign key references; the
    nonclustered index associated with the UNIQUE constraint will be used
    to satisfy queries based on the address.

    If you have no other references, then there are still several options.
    The next thing to consider is how the expected 50,000 new rows per day
    are spread over the day. If they are spread fairly evenly (approx. 30
    rows per minute), then the page splits you'll get from using Address
    as a clustered index won't hurt you. But if they come in only a few
    big chunks that need to be processed quickly, these page splits will
    cause problems.
    There are two ways to get around these page splits. One is the
    surrogate key IDENTITY column mentioned above. The other alternative
    is to keep the table as a heap - that is, have no clustered index at
    all. The uniqueness of the Address column can be enforced with either
    a PRIMARY KEY or a UNIQUE constraint, but you have to specify the
    NONCLUSTERED keyword either way, otherwise SQL Server will default to
    a clustered index.
    Note however that heaps require their own kind of maintenance - if you
    are not aware of the potential pitfalls of heaps, the IDENTITY column
    might be a better choice.

    If the table is not referenced from other tables, and the inserts are
    spread fairly evenly over the day, using Address as primary key,
    enforced by a clustered index, will not be a problem at all. I expect
    no difficulty, even if you have to process 100 or so inserts per
    second. Just make sure to schedule regular rebuilds of the index,
    preferably with a fillfactor setting that allows room for inserting
    new rows without page splits.


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    Saturday, December 18, 2010 12:03 PM
  • Hi Deepak,

    It seems to me that your problem is not database related, not even
    technology related. It's a business problem.

    As David wrote, addresses are awkward (and that's putting it mildly)
    in searches. There can be misspelllings, different abbreviations
    (street / str / st), different order of elements, including or
    excluding postal code, etc.

    The problem to find matching data based on address only is not a
    database problem, as you'd have the exact same problem if the data is
    kept in a flat file, in a huge stack of punch cards, or even engraved
    in stone tablets. The business (managers) will have to work out a
    solution to that problem. You can only implement a solution once it's
    decided what the solution should be.


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    Saturday, December 18, 2010 11:39 PM

All replies

  • Im facing same challange.....

    i have  an extra bigint identity column .. which will be used for reference with other tables.

    ....

    can anybody suggest ... hw can we deal with deepak situation....??

     

     


    --------------------------------- Devender Bijania
    Saturday, December 18, 2010 11:00 AM
  • Address is unlikely to be a reliable choice for a key unless you have some software that will clean the address such that it is always passed in the same format (without variations of spelling, abbreviations etc). Some criteria for a good key are familiarity, simplicity and stability.

    Consider using a hash as the key. Hash the address after removing spaces, punctuation and capitalisation. That should give you the best chance to avoid inserting a duplicate address. Then use some address deduplication software to clean the data on a regular basis and eliminate duplicates. If you Google for address deduplication you will find lots of software on offer to help with this.

    Saturday, December 18, 2010 11:48 AM
  • Hi Deepak,

    My first question is about the comment on this line:

    Address                                   VARCHAR(100)   ----(or may be I am thinking of VARBINARY(512)

    Why on earth would you ever use varbinary to store an address?

    To answer your question, I need a bit more information. Are there
    other tables referencing this table? If so, I would advice you to add
    a surrogate key (IDENTITY column), use that as the primary key and
    suffice with a unique constraint on the Address column. The IDENTITY
    value can then be used to implement foreign key references; the
    nonclustered index associated with the UNIQUE constraint will be used
    to satisfy queries based on the address.

    If you have no other references, then there are still several options.
    The next thing to consider is how the expected 50,000 new rows per day
    are spread over the day. If they are spread fairly evenly (approx. 30
    rows per minute), then the page splits you'll get from using Address
    as a clustered index won't hurt you. But if they come in only a few
    big chunks that need to be processed quickly, these page splits will
    cause problems.
    There are two ways to get around these page splits. One is the
    surrogate key IDENTITY column mentioned above. The other alternative
    is to keep the table as a heap - that is, have no clustered index at
    all. The uniqueness of the Address column can be enforced with either
    a PRIMARY KEY or a UNIQUE constraint, but you have to specify the
    NONCLUSTERED keyword either way, otherwise SQL Server will default to
    a clustered index.
    Note however that heaps require their own kind of maintenance - if you
    are not aware of the potential pitfalls of heaps, the IDENTITY column
    might be a better choice.

    If the table is not referenced from other tables, and the inserts are
    spread fairly evenly over the day, using Address as primary key,
    enforced by a clustered index, will not be a problem at all. I expect
    no difficulty, even if you have to process 100 or so inserts per
    second. Just make sure to schedule regular rebuilds of the index,
    preferably with a fillfactor setting that allows room for inserting
    new rows without page splits.


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    Saturday, December 18, 2010 12:03 PM
  • Hi Hugo & all,

    Thanks for the responses from you all. This table will be stand alone and will not be referenced by any other tables. So having another column as INT is not an option.
    I am ok with storing(inserts) but my main area of concern is querying the data based on address as search parameter.

    And I guess, the last paragraph by Hugo, details about inserts with having clustered index on varchar address field. Now the only question remains is the data fetch given the fact that calling program can provide "only" address as search parameter.

    And based on this scenario I was thinking of having address column as Varbinary(512). The idea thought of was same as what David has suggested to go for hashing. Duplicity of address is permitted with different "ADateTimeCol" values.
    So the questions is how go for with retreival from table given the fact that search will be based on exact search of text & with huge data in table

    In general, the scenario is ... a table whick is expected to grow & can have tons & tons of records... Not only insert, but how the fetch can be made quick-faster by search on varchar field

    Thanks

    • Edited by deepak.kolapkar Saturday, December 18, 2010 6:30 PM spelling corrections
    Saturday, December 18, 2010 6:29 PM
  • Hi Deepak,

    It seems to me that your problem is not database related, not even
    technology related. It's a business problem.

    As David wrote, addresses are awkward (and that's putting it mildly)
    in searches. There can be misspelllings, different abbreviations
    (street / str / st), different order of elements, including or
    excluding postal code, etc.

    The problem to find matching data based on address only is not a
    database problem, as you'd have the exact same problem if the data is
    kept in a flat file, in a huge stack of punch cards, or even engraved
    in stone tablets. The business (managers) will have to work out a
    solution to that problem. You can only implement a solution once it's
    decided what the solution should be.


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    Saturday, December 18, 2010 11:39 PM
  • Hi Hugo,

    I totally agree to you. I too presented the same questions in intial discussions with business managers.
    But that's how the the business requirement is. Business does understand issue of st. sometimes mentioned as street and many other changes.
    Bascailly, at the time of insert, the address provided for insert will be a well parsed address through some application algorithm logic. So the chances of having many variations of same address getting inserted multiple times is rare. Also when a query is made by program, it will try parsing and sending the address to be searched through same algorithm.
    Also business takes responsibility that the data will be returned if the address to searched matches exactly to any of those stored.

    How about applying HASHBYTES and storing as VARBINARY??
    And then while search, compare HASHBYTES of serach value with table field value. Do you see this might improve speed of data fetch??

    Thanks

    Sunday, December 19, 2010 12:24 AM
  • Hi Deepak,

    Yes, I think that indexing and searching on a hash can speed up the
    search. But remember that hash functions have the chance of collisions
    (two different string values returning the same hash value), so the
    hashed column should be indexed with a *non*unique index, and the
    search should also compare the full address of the search (after
    normalising the form) with the stored full address.

    If even with the normalizing step, you still need to find "near
    matches", then the hashing won't work - two almost identical strings
    can result in completely different hash values.


    Hugo Kornelis, SQL Server MVP
    My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
    Sunday, December 19, 2010 11:18 PM