locked
which column PK if I know 0 about the table? RRS feed

  • Question

  • Hi experts,

     

    I was asked to bring a table from  a remote server to our local server and sync it…

     

    I accessed the linked server and the table resulted to be a view… No problem, I used the wizard and brought entirely, and I have a weekly SP that brings the records that have a higher UpdateDate than the highest I have locally…

                                                                  

    My question is… the wizard brings it as a heap… Should I create a PK… or any index? How can I do that if I know nothing about the table… Is it ok to pick the first unique column as a PK?

     

    Ps: users still don’t know how exactly are they going to query it, so I can’t ask them…

    Thursday, August 9, 2018 11:09 AM

All replies

  • and the table resulted to be a view

    Hello,

    If the source is a view then it's difficult, because the result will change over time and what Looks now as unique is may in a few days not longer unique.


    Olaf Helper

    [ Blog] [ Xing] [ MVP]

    Thursday, August 9, 2018 11:23 AM
  • Its difficult if you don't know what data table contains and what combination would give you a realistic PK.

    If you want to make PK, you should have clear understanding on table data and its relevance


    Please Mark This As Answer if it solved your issue
    Please Vote This As Helpful if it helps to solve your issue
    Visakh
    ----------------------------
    My Wiki User Page
    My MSDN Page
    My Personal Blog
    My Facebook Page

    Thursday, August 9, 2018 11:25 AM
  • My question is… the wizard brings it as a heap… Should I create a PK… or any index? How can I do that if I know nothing about the table… Is it ok to pick the first unique column as a PK?

    You might try executing sp_describe_first_result_set (https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-describe-first-result-set-transact-sql) to list view meta-data. Browse information mode 1 will list each column used in the view. Columns with is_hidden = 0 and is_part_of_unique_key = 1 could be used as a primary key.

    As to whether you should create a primary key, probably not for the purpose of data integrity since this table is apparently a read-only copy and integrity is presumably maintained in the source database. You should still create indexes (unique and/or clustered) to optimize your queries. One of those could also be the PK.

    Usage example:

    EXEC sp_describe_first_result_set 
    	  @tsql = N'SELECT * FROM [YourLinkedServer].YourDatabase.YourSchema.YourView;'
    	, @parms = NULL
    	, @browse_information_mode = 1;


    Dan Guzman, Data Platform MVP, http://www.dbdelta.com


    Thursday, August 9, 2018 12:22 PM
  • Hi Maca,

    "My question is… the wizard brings it as a heap… Should I create a PK… or any index? How can I do that if I know nothing about the table… Is it ok to pick the first unique column as a PK?"

    • May I ask, WHY you want to have a PK or CI on your HEAP?
    • Is the "table" used in a referentiel integritiy model in your database?
    • Do you want to add new data into the new table?
    • Do you want to change data in the new table?
    • Are you planning index maintenance on the new table?

    If you can answer all questions with NO I would leave he table as a heap and create an AUDIT to monitor the access to the table. Study afterwards the access patterns and create appropriate nonclustered indexes on the predicate attributes.

    A CI would cause much more IO for key lookups than a heap (which only need 1 logical IO.

    Because you've written "users still don’t know how exactly are they going to query it, so I can’t ask them…" a heap causes less IO with a FULL SCAN than a PK / CI!

    I am wondering why so many people see it as mandatory to have a CI or PK on a table :)

    All depends on the workloads and the access patterns.

    Mabye this excellent article about some drawbacks of CI may change your mind:

    http://kejser.org/clustered-indexes-vs-heaps/

    Or this one - which is especially a good point if you only read your data!

    https://use-the-index-luke.com/blog/2014-01/unreasonable-defaults-primary-key-clustering-key


    Uwe Ricken (Blog | Twitter)
    Microsoft Certiied Master - SQL Server 2008
    Microsoft Certified Solution Master - CHARTER Data Platform
    Microsoft Certified Solution Expert - Data Platform
    db Berater GmbH
    Microsoft SQL Server Blog (german only)

    Saturday, August 11, 2018 10:31 AM
  • I am wondering why so many people see it as mandatory to have a CI or PK on a table :)

    You're wondering why people think it is mandatory to have a primary key on a table?????????????????

    Well, in a relational database it is more or less mandatory. If you don't have a primary key, how will you find the data?

    A CI would cause much more IO for key lookups than a heap (which only need 1 logical IO.

    But you thought keys were not necessary? Without keys, you will need to scan your heap.

    As for CI vs. heaps, the short story is that in SQL Server clustered indexes are first-class citizens, while heaps are treated as something as the cat brought in. You can get better performance with heaps if you know exactly what you are doing, but most of us don't.

    So, yes, for the people hanging out in these forums, it's best to recommend them to have a clusterd index. As long as they are on SQL Server. On Oracle, it is entirely different.

    Saturday, August 11, 2018 11:21 AM
  • Hi Erland,

    don't get me wrong - A PK is a mandatory object for a fully relational database model. But in my carreer I have seen so often additional tables which do not fit into the relational model itself but are more for "temporary" purposes.

    In all these cases I think a PK is nonsence but - MAYBE - a clustered index may be helpful.

    "As for CI vs. heaps, the short story is that in SQL Server clustered indexes are first-class citizens, while heaps are treated as something as the cat brought in. "

    It always depends on the workload. There are lots of situations where the heap is the first-class citizen :)

    "So, yes, for the people hanging out in these forums, it's best to recommend them to have a clusterd index."

    I think it's better to point people to alternatives rather than just spreading stereotypes. If a CI cannot scale (because of the f... IDENTITY()) than it seems better to me to point to alternatives. :)

    From the request:

    "and I have a weekly SP that brings the records that have a higher UpdateDate than the highest I have locally"

    In this special case and the information that no other indexes will be set I would recommend a clustered index on the UPDATEDATE. That would speed up the update process. There are two reasons for this decision:

    • quick search for affected rows
    • no forwarded records (if varchar is used)
    • no updates of two objects (HEAP + NCI on UPDATEDATE)

    If the "brings" does mean that only NEW data will be inserted into the table I would recommend to leave it as a HEAP.

    • If the database is running im SIMPLE / BULK LOGGED we can use minimal logging with TABLOCK
    • No data will be changed; there is no risk for Forwarded Records
    • Only a NCI on the UPDATEDATE column has to be updated (small)

    All the best from Gemany...


    Uwe Ricken (Blog | Twitter)
    Microsoft Certiied Master - SQL Server 2008
    Microsoft Certified Solution Master - CHARTER Data Platform
    Microsoft Certified Solution Expert - Data Platform
    db Berater GmbH
    Microsoft SQL Server Blog (german only)

    Saturday, August 11, 2018 11:52 AM
  • don't get me wrong

    Of course, I get you wrong - life is more fun that way!

    - A PK is a mandatory object for a fully relational database model. But in my carreer I have seen so often additional tables which do not fit into the relational model itself but are more for "temporary" purposes.

    In all these cases I think a PK is nonsence but - MAYBE - a clustered index may be helpful.

    Calling PK a nonsense - nah, that is not convincing.

    Let's take temp tables. Yes, they are temporary, so why would they need
    PKs and that crap?

    Experience has taught me that it is always a good idea to identify a unique key (sometimes not a PK, because it may involve a nullable column) in your temp tables or table variables. A real key, not an IDENTITY slapped on, but never used. There are two advantanges:

    1) It serves as an assertion about my assumption about the data.
    2) It can help my queries to run faster.

    I will have to admit to have designed a few tables without a PK. One is a transient table which is only used as a computation are and which is quite denormalised.

    And then there are some tables that do not have a PK, as the syntax does not support zero-column keys. That is, they are single-row tables.

    I think it's better to point people to alternatives rather than just spreading stereotypes. If a CI cannot scale (because of the f... IDENTITY()) than it seems better to me to point to alternatives. :)

    CIs with an ascending key scales well for most people, but there are high-end cases where they do not. Here we are talking about a table which is refreshed once a week, so this is not an issue here.

    But whether maca actually should modify the table is of course nothing we can answer, since we don't know the details. Maybe the local copy should be a true copy of the original. Maybe it should not.

    However, it is worth pointing out that the the copy is refreshed weekly and this happens with DELETE + INSERT, this can lead to a table which is quite inflated if you go for a heap. And that is one of the reasons, most of us commoners should stay with CI. Heaps require more care.

    Saturday, August 11, 2018 3:08 PM
  • It always depends on the workload. There are lots of situations where the heap is the first-class citizen :)

    Hi Uwe, I know that you know all this stuff backwards and forwards, but let me toss in a couple of ways of talking about it.

    A heap is never a first-class citizen - but there may be nothing you want to do about it, especially if it's small.

    OP didn't tell us, but I think his data is large - thousands of rows, maybe millions.

    Now the thing is, any old resultset - and if it comes from a view that's all it might be - is not a normalized table.  So you can't treat it like a normalized table.  Only a normalized table is a first-class citizen!

    But as for why a PK or CI, it's because SQL Server "likes" it that way.

    If the data is a million rows and is going to be queried by a lot of people, you probably want several indexes on it, and on SQL Server if you have several indexes you'd like to have an efficient and unique CI, and if you have that it's also a good PK.

    Again, I know you know all this, but OP does not know all this.

    Finally - OP, if you can't find a good unique key in the data then it is not already normalized, but you can always add an identity field during the import and that can be your PK and/or CI, and this is done very frequently in SQL Server for the reasons above.

    Josh

    Saturday, August 11, 2018 4:59 PM