none
does creating a clustered index after the table is loaded essentially load it again RRS feed

  • Question

  • Hi, we run 2017 std. I'm looking thru some work others have done here and noticed they truncate tables, drop indices, load tables and recreate indices in their etl.

    I'm used to seeing this practice on non clustered indexes but not clustered ones.  Don't clustered indexes essentially determine the order in which the data will attempt to be loaded?  Which means creating a clustered index after the data is loaded will effectively reload the entire table?  Which is usually not the most efficient thing in the world? 


    • Edited by db042190 Tuesday, November 26, 2019 4:40 PM better title
    Tuesday, November 26, 2019 3:55 PM

Answers

  • Which means creating a clustered index after the data is loaded will effectively reload the entire table?  

    SQL Server engine isn't aware where the origin data came from, it can not reload any data. With creation of a clustered index the data pages for the table gets rearrange to align with the clustered index.

    Olaf Helper

    [ Blog] [ Xing] [ MVP]

    • Marked as answer by db042190 Wednesday, November 27, 2019 6:06 PM
    Tuesday, November 26, 2019 4:46 PM
    Moderator
  • it depends on what you are doing, as usual

    if you are doing the data load to an empty table from a single thread, then its ok to have indexes in place. if you are loading in parallel from multiple threads, then the table should have zero indexes of any kind.

    also, if you load data into an empty table with a clustered index in place, after the load, the b-tree will not be in its most compact form, is this important? if so, rebuild the index, in which case you may as well start with zero indexes.

    My preference is to have two filegroups, aside from PRIMARY.

    1. Truncate the table

    2. Drop the nonclustered indexes,

    3. rebuild the clustered index from filegroup A to FG B.

    4. drop the clustered index

    5. parallel data load

    6. create clustered index to FG A (moving from A, sort in tempdb)

    7. build nonclustered indexes.

    but it all depends on what you want


    jchang

    • Proposed as answer by André Renato Furtado Tuesday, November 26, 2019 8:21 PM
    • Unproposed as answer by db042190 Wednesday, November 27, 2019 11:34 AM
    • Marked as answer by db042190 Wednesday, November 27, 2019 6:06 PM
    Tuesday, November 26, 2019 7:04 PM
  • What could it possibly be accomplishing unless he is setting things up for insertions that take place the rest of the day from more of a single source and more of a predictable order?  Or perhaps the proximity of records extracted via that index offers some economical advantage over what might otherwise be thrashing in a very random insertion scheme left unindexed?

    In general the CI is about retrieval, not insertion.  That's a grand generalization of course, and there are systems that work other ways.  But generally databases do 99% reads and 1% (or less) insertions, with a few updates and deletes as needed.  The online stuff tends to be 99% reads.

    But batch scenarios where you reimport a big table every night, those are the problematic ones.

    Batch scenarios where you add a million rows per night to a 100gb++ table, those are tough too, but usually best addressed with either partitioned tables or a current and history system of some kind.

    Josh


    • Edited by JRStern Wednesday, November 27, 2019 6:03 PM
    • Marked as answer by db042190 Wednesday, November 27, 2019 6:06 PM
    Wednesday, November 27, 2019 6:02 PM

All replies

  • Which means creating a clustered index after the data is loaded will effectively reload the entire table?  

    SQL Server engine isn't aware where the origin data came from, it can not reload any data. With creation of a clustered index the data pages for the table gets rearrange to align with the clustered index.

    Olaf Helper

    [ Blog] [ Xing] [ MVP]

    • Marked as answer by db042190 Wednesday, November 27, 2019 6:06 PM
    Tuesday, November 26, 2019 4:46 PM
    Moderator
  • thx Olaf, is that unwise?

    9 times out of 10 wouldn't loading from the start with the ci present be more efficient?

    These tables are reloaded from scratch once daily.

    In companies that reload, what do you see more often?  a drop of ci's or leaving them present during initial load?


    • Edited by db042190 Tuesday, November 26, 2019 5:05 PM trying to steer the question
    Tuesday, November 26, 2019 5:03 PM
  • it depends on what you are doing, as usual

    if you are doing the data load to an empty table from a single thread, then its ok to have indexes in place. if you are loading in parallel from multiple threads, then the table should have zero indexes of any kind.

    also, if you load data into an empty table with a clustered index in place, after the load, the b-tree will not be in its most compact form, is this important? if so, rebuild the index, in which case you may as well start with zero indexes.

    My preference is to have two filegroups, aside from PRIMARY.

    1. Truncate the table

    2. Drop the nonclustered indexes,

    3. rebuild the clustered index from filegroup A to FG B.

    4. drop the clustered index

    5. parallel data load

    6. create clustered index to FG A (moving from A, sort in tempdb)

    7. build nonclustered indexes.

    but it all depends on what you want


    jchang

    • Proposed as answer by André Renato Furtado Tuesday, November 26, 2019 8:21 PM
    • Unproposed as answer by db042190 Wednesday, November 27, 2019 11:34 AM
    • Marked as answer by db042190 Wednesday, November 27, 2019 6:06 PM
    Tuesday, November 26, 2019 7:04 PM
  • thx jchang61. Apparently clustered indexes are more independent of the pages of data than I thought.

    In the old days when I heard about hot spots, I just kind of assumed that sql would do its best regardless of fragmentation incurred to ensure the data was physically ordered to coincide with the clustered index.  And therefore it wasn't crazy to bring the data in sorted on the ci's columns when one time loads were involved.

    this particular app has a data flow task which I have to believe is single threaded on its way to a sql destination.  There is one source query.  I've never heard of sql parallelizing inserts.  But who knows. 

    I think the most important factor in our shop currently is speed of loading the data.  Rebuilding indexes would have to count against (penalize) that speed.  9 out of 10 of our tables are used once in a 24 hr period to load specialized marts. 

    Tuesday, November 26, 2019 7:37 PM
  • Yes, it is certainly my impression that creating a new CI means rewriting the entire table.

    However, that does not really answer the question of whether it is more efficient to create the CI on the empty table and then load the data.  Only if the data arrives with CI keys in order (as with identity fields) would it be comparable.  If it arrives in random order you will be getting page splits all day and it might end up slower than doing it afterwards, whatever the cost.  TANSTAAFL.

    As usual, depending on the details YMMV and sometimes you just have to try and see.

    I've certainly done it both ways, uncertain as to which is better, and maybe the difference is often small anyway.

    Josh


    • Edited by JRStern Wednesday, November 27, 2019 6:29 AM
    Wednesday, November 27, 2019 6:28 AM
  • Thx JR. I could be wrong but if the main reason for a CI is to align the order of data insertion as efficiently as possible and as much as possible to the order things will be coming in to avoid the overhead of fragmentation, then I question why jchang61 is creating one after the fact.  What could it possibly be accomplishing unless he is setting things up for insertions that take place the rest of the day from more of a single source and more of a predictable order?  Or perhaps the proximity of records extracted via that index offers some economical advantage over what might otherwise be thrashing in a very random insertion scheme left unindexed?

    • Edited by db042190 Wednesday, November 27, 2019 11:56 AM maybe its for read purposes
    Wednesday, November 27, 2019 11:40 AM
  • What could it possibly be accomplishing unless he is setting things up for insertions that take place the rest of the day from more of a single source and more of a predictable order?  Or perhaps the proximity of records extracted via that index offers some economical advantage over what might otherwise be thrashing in a very random insertion scheme left unindexed?

    In general the CI is about retrieval, not insertion.  That's a grand generalization of course, and there are systems that work other ways.  But generally databases do 99% reads and 1% (or less) insertions, with a few updates and deletes as needed.  The online stuff tends to be 99% reads.

    But batch scenarios where you reimport a big table every night, those are the problematic ones.

    Batch scenarios where you add a million rows per night to a 100gb++ table, those are tough too, but usually best addressed with either partitioned tables or a current and history system of some kind.

    Josh


    • Edited by JRStern Wednesday, November 27, 2019 6:03 PM
    • Marked as answer by db042190 Wednesday, November 27, 2019 6:06 PM
    Wednesday, November 27, 2019 6:02 PM