locked
ColumnsStore Index - RowGroup Elimination - Inserting in Sorted Order RRS feed

  • Question

  • Hi

    In 2014, In order to be able to do RowGroup elimination, we need "a particular column" to be in sorted order "within a partition".

    How do we ensure this particular column's values are in sorted order after a workload/batch ? Is it through MAXDOP = 1 feature when we create CCI Index?   For example, if my workload contains 3 million rows on the first day (when my table was empty), does MAXDOP = 1 guarantee that after the successful load, we have (nearly) 3 Rowgroups (and thus 3 segments) each with 1 million rows, and MAX value of Segment 1 is less-than-or-equal-to MIN value Segment 2, and so on?

    Extending this to day2, if I load further 3 million, will all the 6 rowgroups (and thus 6 segments) be in sorted order because of MAXDOP = 1 settings at the time of CCI index creation?

    thanks

    Friday, August 16, 2019 7:14 PM

Answers

  • How do we ensure this particular column's values are in sorted order after a workload/batch ? Is it through MAXDOP = 1 feature when we create CCI Index?   For example, if my workload contains 3 million rows on the first day (when my table was empty), does MAXDOP = 1 guarantee that after the successful load, we have (nearly) 3 Rowgroups (and thus 3 segments) each with 1 million rows, and MAX value of Segment 1 is less-than-or-equal-to MIN value Segment 2, and so on?

    Extending this to day2, if I load further 3 million, will all the 6 rowgroups (and thus 6 segments) be in sorted order because of MAXDOP = 1 settings at the time of CCI index creation?

    The MAXDOP setting used for the initial CCI creation is not relevant afterwards.

    More specifics about your workload/batch inserts are needed to suggest the CCI ordering implications for rowgroup elimination in your scenario. Check the INSERT...SELECT and bulk insert execution plans for parallelism, sort operators, and source ordering. In the case of INSERT...SELECT queries with no sort operator, see if the source rows are retrieved in the desired order with an ordered index seek/scan of the source. With bulk inserts, make sure the source data is ordered by the partitioning column plus your elimination column and specify a bulk insert ordering hint on the partitioning column to avoid a sort operator in the plan.


    Dan Guzman, Data Platform MVP, http://www.dbdelta.com

    Saturday, August 17, 2019 1:08 PM
    Answerer
  • As you replied to my other post on SQL version that support Rowgroup/Segment elimination, since I use 2014, I believe, there is no point thinking about ordering my data during workload/batch insertion or rebuilding indexes after loading, as there is no support for Rowgroup/Segment elimination feature before 2016,  am I right to assume so?

    I think you misunderstood my answer to your related question. Rowgroups are eliminated in both SQL 2014 and SQL 2016. It is only that the statistics IO message in SQL 2014 doesn't include the skipped message even though row groups are skipped. So there is benefit to ordering data to promote row group elimination in SQL 2014.

    Sorry if I was not clear.


    Dan Guzman, Data Platform MVP, http://www.dbdelta.com

    Sunday, August 18, 2019 1:27 PM
    Answerer
  • Darn, I just typed a long reply and it got lost... A bit shorter (ask followups if unclear):

    When creating the CCI, the table should have a row-based CL ix on the "segment column". I.e., the actual data is now sorted by your segment column. And use MAXDOP 1, else you would have one "first segment" per CPU. I.e., no syntax element for this, we have to use "tweaks" like these.

    As for next bit, I leave it to somebody else. I know of no other way that converting the table to a row-based CL ix and back to a CCI table (again, using MAXDOP 1), whcih is a bit harsh. But i'm likely missing that bit...


    Tibor Karaszi, SQL Server MVP (Web Blog)

    Monday, August 19, 2019 7:51 PM

All replies

  • be in sorted order because of MAXDOP = 1 

    MAXDOP (= Maximum degree of parallelism) controls the count of parallel thread to execute the query and has nothing to do with sort order.

    How do we ensure this particular column's values are in sorted order

    By applying a ORDER BY clause to the SELECT statement. Data don't have a natural order.


    Olaf Helper

    [ Blog] [ Xing] [ MVP]

    Saturday, August 17, 2019 5:46 AM
    Answerer
  • Hi Olaf, thanks for the reply.

    https://blogs.msdn.microsoft.com/sql_server_team/columnstore-index-performance-rowgroup-elimination/

    https://www.sqlpassion.at/archive/2017/01/30/columnstore-segment-elimination/

    As explained in the above blogs, for SQL engine to be able to make decision on eliminating RowGroups that are not required, we need to ensure the data is stored in sorted order (so, SELECT ….ORDER BY  is not the solution required here)

    So, this raises the following questions

    1) How do we ensure the data is Inserted in SORT order into each of the partitions, for each of the daily batch workloads?

    2) Assuming the data is stored in the sorted order WITHIN EACH PARTITION, how do we hint SQL engine to make use of RowGroup elimination feature?  Is it through MAXDOP=1  settings?   If so, are we not compromising on the multithreaded efficient way of SQL's processing?

    thanks

    Saturday, August 17, 2019 7:52 AM
  • How do we ensure this particular column's values are in sorted order after a workload/batch ? Is it through MAXDOP = 1 feature when we create CCI Index?   For example, if my workload contains 3 million rows on the first day (when my table was empty), does MAXDOP = 1 guarantee that after the successful load, we have (nearly) 3 Rowgroups (and thus 3 segments) each with 1 million rows, and MAX value of Segment 1 is less-than-or-equal-to MIN value Segment 2, and so on?

    Extending this to day2, if I load further 3 million, will all the 6 rowgroups (and thus 6 segments) be in sorted order because of MAXDOP = 1 settings at the time of CCI index creation?

    The MAXDOP setting used for the initial CCI creation is not relevant afterwards.

    More specifics about your workload/batch inserts are needed to suggest the CCI ordering implications for rowgroup elimination in your scenario. Check the INSERT...SELECT and bulk insert execution plans for parallelism, sort operators, and source ordering. In the case of INSERT...SELECT queries with no sort operator, see if the source rows are retrieved in the desired order with an ordered index seek/scan of the source. With bulk inserts, make sure the source data is ordered by the partitioning column plus your elimination column and specify a bulk insert ordering hint on the partitioning column to avoid a sort operator in the plan.


    Dan Guzman, Data Platform MVP, http://www.dbdelta.com

    Saturday, August 17, 2019 1:08 PM
    Answerer
  • Thanks Dan.

    As you replied to my other post on SQL version that support Rowgroup/Segment elimination, since I use 2014, I believe, there is no point thinking about ordering my data during workload/batch insertion or rebuilding indexes after loading, as there is no support for Rowgroup/Segment elimination feature before 2016,  am I right to assume so?

    thanks

    Sunday, August 18, 2019 12:55 PM
  • As you replied to my other post on SQL version that support Rowgroup/Segment elimination, since I use 2014, I believe, there is no point thinking about ordering my data during workload/batch insertion or rebuilding indexes after loading, as there is no support for Rowgroup/Segment elimination feature before 2016,  am I right to assume so?

    I think you misunderstood my answer to your related question. Rowgroups are eliminated in both SQL 2014 and SQL 2016. It is only that the statistics IO message in SQL 2014 doesn't include the skipped message even though row groups are skipped. So there is benefit to ordering data to promote row group elimination in SQL 2014.

    Sorry if I was not clear.


    Dan Guzman, Data Platform MVP, http://www.dbdelta.com

    Sunday, August 18, 2019 1:27 PM
    Answerer
  • Darn, I just typed a long reply and it got lost... A bit shorter (ask followups if unclear):

    When creating the CCI, the table should have a row-based CL ix on the "segment column". I.e., the actual data is now sorted by your segment column. And use MAXDOP 1, else you would have one "first segment" per CPU. I.e., no syntax element for this, we have to use "tweaks" like these.

    As for next bit, I leave it to somebody else. I know of no other way that converting the table to a row-based CL ix and back to a CCI table (again, using MAXDOP 1), whcih is a bit harsh. But i'm likely missing that bit...


    Tibor Karaszi, SQL Server MVP (Web Blog)

    Monday, August 19, 2019 7:51 PM