locked
Question about table partitioning... RRS feed

  • Question

  • Hello, all.  I'm using SQL 2012 Enterprise.  I have 8 very large tables (the largest two having 227M and 118M records, and the others between 11M-44M records).  For performance reasons, I'm considering partitioning the tables across multiple files/filegroups.  For my largest table (227M records), the data is spread across years 2011, 2012, and 2013 with 2013 having 104M records.  So naturally I'm considering partitioning on a Date column.  My question is should I go with four partitions (2011, 2012, 2013 and 2014 for new data) and still end up with a very large aggregation of data on the 2013 partition (104M) or should I further breakdown the 2013 partition into months now having 12 partitions for 2013 alone, and then I'm OK with all of 2011 and 2012 on their own partitions.  Again, this is for one table.  I'd still like to partition the other 7 large tables.  In the end, I could end up with many, many partitions and hence many, many filegroups.  I'm interested in how others partition MULTIPLE large tables.  Can you share partition functions/schemes across tables?

    Any thoughts, your own personal experiences, etc would be greatly appreciated.  Also, can someone recommend a good book, article, blog, etc on partitioning large databases.

    Thanks much in advance.

    Roz

    Tuesday, December 31, 2013 9:43 PM

Answers

  • For performance reasons, I'm considering partitioning the tables across multiple files/filegroups.  For my largest table (227M records), the data is spread across years 2011, 2012, and 2013 with 2013 having 104M records.  So naturally I'm considering partitioning on a Date column. 

    Yes, a partition function can be shared by multiple partition schemes and a scheme can be shared by multiple tables.  You need to be aware of the implications, though.  A change to the function will impact all of the dependent schemes and the tables/indexes that use those schemes. 

    You mention you goal is performance improvement.  Why do you expect performance improvement with partitioning compared to non-partitioned tables?  Is it that you plan to leverage partition switching for incremental load performance?  What sort of workload does this database support?

    You don't need a separate filegroup for each partition.  Exactly how many you should have depends on many factors.  From a performance perspective, isolating individual partitions on separate filegroups will be beneficial only if the workload scans data on different filegroups (on spinning media) concurrently.  This strategy assumes you have separate physical storage dedicated to each filegroup.  Note that partitioning on different filegroups will help here only for large concurrent scans of different years of data within the same table.

    For manageability purposes, you could have one filegroup per year yet partition by month.  This will facilitate granular monthly partition-level operations without creating many filegroups needlessly.  Partitioning on different filegroups by year can also reduce RTO of more frequently used recent data; you could restore the primary and most recent filegroups first, followed by older less often used data.  This strategy will work only for queries that touch partitions only on restored filegroup.

    You are wise to plan carefully before implementing partitioning with large tables.  Large tables are unforgiving if you make a mistake.


    Dan Guzman, SQL Server MVP, http://www.dbdelta.com

    • Proposed as answer by Sofiya Li Thursday, January 2, 2014 3:26 AM
    • Marked as answer by Sofiya Li Wednesday, January 8, 2014 9:16 AM
    Wednesday, January 1, 2014 2:26 PM
    Answerer
  • My thoughts about performance improvement imply since I've broken the data into smaller datasets (ie partitions), my queries will access less data and hence less IO.  Knowing how the data is queried drives my desire to partition in the first place.  This database is a data warehouse obviously for reporting.

    For report queries that must scan an entire year of data, I would expect similar performance with or without partitioning as long as you have a clustered (or covering) index on date.  But I'd still be inclined to partition for the manageability reasons I mentioned earlier.

    A columnstore index is a different beast.  In that case, partitioning can limit the amount of data scanned compared to a non-partitioned version.  So partitioning the columnstore index can improve performance considerably.


    Dan Guzman, SQL Server MVP, http://www.dbdelta.com


    • Edited by Dan GuzmanMVP, Editor Wednesday, January 1, 2014 4:25 PM grammar
    • Proposed as answer by Sofiya Li Thursday, January 2, 2014 3:26 AM
    • Marked as answer by Sofiya Li Wednesday, January 8, 2014 9:16 AM
    Wednesday, January 1, 2014 4:23 PM
    Answerer

All replies

  • Use the single Partition file and  implement switch based logic.  That is like soft delete

    Use below sysobjects to achieve   sys.partition_functions,sys.schemes,sys.partitions,sys.allocation_units,sys.filegroups

     


     

    Thiru


    Tuesday, December 31, 2013 10:34 PM
  • If you query against more than one partition I have doubts  you will gain performance...

    -- Create partition functions
    CREATE PARTITION FUNCTION PF1(INT) AS RANGE RIGHT FOR VALUES (1, 2, 3);
    CREATE PARTITION FUNCTION PF2(INT) AS RANGE RIGHT FOR VALUES (1, 2);

    -- Create filegroups
    ALTER DATABASE testdb ADD FILEGROUP FG7;
    ALTER DATABASE testdb ADD FILEGROUP FG6;
    ALTER DATABASE testdb ADD FILEGROUP FG5;
    ALTER DATABASE testdb ADD FILEGROUP FG4;
    ALTER DATABASE testdb ADD FILEGROUP FG3;
    ALTER DATABASE testdb ADD FILEGROUP FG2;
    ALTER DATABASE testdb ADD FILEGROUP FG1;

    -- Create partition schemes
    CREATE PARTITION SCHEME PS1 AS PARTITION PF1
    TO (FG1, FG2, FG3, FG4);
    CREATE PARTITION SCHEME PS2 AS PARTITION PF2
    TO (FG5, FG6, FG7);

    CREATE VIEW [dbo].[partition_info] 

    AS

    SELECT
    DB_NAME() AS 'DatabaseName'
    ,OBJECT_NAME(p.OBJECT_ID) AS 'TableName'
    ,p.index_id AS 'IndexId'
    ,CASE
    WHEN p.index_id = 0 THEN 'HEAP'
    ELSE i.name
    END AS 'IndexName'
    ,p.partition_number AS 'PartitionNumber'
    ,prv_left.value AS 'LowerBoundary'
    ,prv_right.value AS 'UpperBoundary'
    ,ps.name as PartitionScheme
    ,pf.name as PartitionFunction
    ,CASE
    WHEN fg.name IS NULL THEN ds.name
    ELSE fg.name
    END AS 'FileGroupName'
    ,CAST(p.used_page_count * 0.0078125 AS NUMERIC(18,2)) AS 'UsedPages_MB'
    ,CAST(p.in_row_data_page_count * 0.0078125 AS NUMERIC(18,2)) AS 'DataPages_MB'
    ,CAST(p.reserved_page_count * 0.0078125 AS NUMERIC(18,2)) AS 'ReservedPages_MB'
    ,CASE
    WHEN p.index_id IN (0,1) THEN p.row_count
    ELSE 0
    END AS 'RowCount'
    ,CASE
    WHEN p.index_id IN (0,1) THEN 'data'
    ELSE 'index'
    END 'Type'
    FROM sys.dm_db_partition_stats p
    INNER JOIN sys.indexes i
    ON i.OBJECT_ID = p.OBJECT_ID AND i.index_id = p.index_id
    INNER JOIN sys.data_spaces ds
    ON ds.data_space_id = i.data_space_id
    LEFT OUTER JOIN sys.partition_schemes ps
    ON ps.data_space_id = i.data_space_id
    LEFT OUTER JOIN sys.partition_functions pf
    ON ps.function_id = pf.function_id
    LEFT OUTER JOIN sys.destination_data_spaces dds
    ON dds.partition_scheme_id = ps.data_space_id
    AND dds.destination_id = p.partition_number
    LEFT OUTER JOIN sys.filegroups fg
    ON fg.data_space_id = dds.data_space_id
    LEFT OUTER JOIN sys.partition_range_values prv_right
    ON prv_right.function_id = ps.function_id
    AND prv_right.boundary_id = p.partition_number
    LEFT OUTER JOIN sys.partition_range_values prv_left
    ON prv_left.function_id = ps.function_id
    AND prv_left.boundary_id = p.partition_number - 1
    WHERE
    OBJECTPROPERTY(p.OBJECT_ID, 'ISMSSHipped') = 0
    AND p.index_id IN (0,1)

    GO


    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Consulting: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance
    SQL Server Integration Services: Business Intelligence


    • Edited by Uri DimantMVP Wednesday, January 1, 2014 6:02 AM
    • Proposed as answer by Sofiya Li Wednesday, January 1, 2014 7:56 AM
    • Unproposed as answer by Sofiya Li Thursday, January 2, 2014 3:26 AM
    Wednesday, January 1, 2014 6:01 AM
  • For performance reasons, I'm considering partitioning the tables across multiple files/filegroups.  For my largest table (227M records), the data is spread across years 2011, 2012, and 2013 with 2013 having 104M records.  So naturally I'm considering partitioning on a Date column. 

    Yes, a partition function can be shared by multiple partition schemes and a scheme can be shared by multiple tables.  You need to be aware of the implications, though.  A change to the function will impact all of the dependent schemes and the tables/indexes that use those schemes. 

    You mention you goal is performance improvement.  Why do you expect performance improvement with partitioning compared to non-partitioned tables?  Is it that you plan to leverage partition switching for incremental load performance?  What sort of workload does this database support?

    You don't need a separate filegroup for each partition.  Exactly how many you should have depends on many factors.  From a performance perspective, isolating individual partitions on separate filegroups will be beneficial only if the workload scans data on different filegroups (on spinning media) concurrently.  This strategy assumes you have separate physical storage dedicated to each filegroup.  Note that partitioning on different filegroups will help here only for large concurrent scans of different years of data within the same table.

    For manageability purposes, you could have one filegroup per year yet partition by month.  This will facilitate granular monthly partition-level operations without creating many filegroups needlessly.  Partitioning on different filegroups by year can also reduce RTO of more frequently used recent data; you could restore the primary and most recent filegroups first, followed by older less often used data.  This strategy will work only for queries that touch partitions only on restored filegroup.

    You are wise to plan carefully before implementing partitioning with large tables.  Large tables are unforgiving if you make a mistake.


    Dan Guzman, SQL Server MVP, http://www.dbdelta.com

    • Proposed as answer by Sofiya Li Thursday, January 2, 2014 3:26 AM
    • Marked as answer by Sofiya Li Wednesday, January 8, 2014 9:16 AM
    Wednesday, January 1, 2014 2:26 PM
    Answerer
  • Dan,

    Thanks much for the reply and great insight.  My thoughts about performance improvement imply since I've broken the data into smaller datasets (ie partitions), my queries will access less data and hence less IO.  Knowing how the data is queried drives my desire to partition in the first place.  This database is a data warehouse obviously for reporting.

    Thanks for clarification of not needing separate filegroups for each partition.  I just learned something here!  :)

    I guess my case, I'll create a filegroup per year and then partition by the month as you suggest.  90% of my report queries access the data either for a whole year or a month/year combo.

    As for storage, I don't have separate physical drives for the filegroups.  All filegroups sits on the same physical drive (ie SAN).  The only difference is that the data and logs are split obviously.

    On another note, I'm also considering column store indexes to speed up the reports.

    Thanks again for the feedback.


    • Edited by Roshall Wednesday, January 1, 2014 2:57 PM
    Wednesday, January 1, 2014 2:55 PM
  • My thoughts about performance improvement imply since I've broken the data into smaller datasets (ie partitions), my queries will access less data and hence less IO.  Knowing how the data is queried drives my desire to partition in the first place.  This database is a data warehouse obviously for reporting.

    For report queries that must scan an entire year of data, I would expect similar performance with or without partitioning as long as you have a clustered (or covering) index on date.  But I'd still be inclined to partition for the manageability reasons I mentioned earlier.

    A columnstore index is a different beast.  In that case, partitioning can limit the amount of data scanned compared to a non-partitioned version.  So partitioning the columnstore index can improve performance considerably.


    Dan Guzman, SQL Server MVP, http://www.dbdelta.com


    • Edited by Dan GuzmanMVP, Editor Wednesday, January 1, 2014 4:25 PM grammar
    • Proposed as answer by Sofiya Li Thursday, January 2, 2014 3:26 AM
    • Marked as answer by Sofiya Li Wednesday, January 8, 2014 9:16 AM
    Wednesday, January 1, 2014 4:23 PM
    Answerer
  • Dan,

    Do you typically create your column store index on ALL of the columns of a table? Everything I'm reading indicates to do so.  Seems pretty simple to do.

    Thoughts?

    Wednesday, January 1, 2014 5:16 PM
  • Yes, one typically creates the columnstore index on all supported columns.  But if you have some columns that are never used in the scan queries, you can omit those to save space and avoid the associated overhead during the index create.

    You might peruse the columnstore FAQ, but be aware of the significant differences in SQL 2012 versus SQL 2014.  Columnstore indexes in SQL 2012 are non-clustered and read-only.

    http://social.technet.microsoft.com/wiki/contents/articles/3540.sql-server-columnstore-index-faq.aspx


    Dan Guzman, SQL Server MVP, http://www.dbdelta.com

    • Proposed as answer by Sofiya Li Thursday, January 2, 2014 3:26 AM
    Wednesday, January 1, 2014 6:06 PM
    Answerer