locked
SQL 2008 R2: MDW Database grows large, Purge doesn't work RRS feed

  • Question

  • Hi All,

    I have a SQL 2008 R2 Enterprise instance (named SQLUTILITY) which is monitoring six other SQL 2008 R2 instances.  This instance has a single database: Sysutility_MDW.  The database was created when I ran the wizard to make it a Utility Control Point.  I then ran the UCP wizard on each monitored instance to point to SQLUTILITY.

    I then also enabled Data Collection on each of the monitored instance to gather additional statistics (Server, Query, Disk).  The collectors and UCP have been running on each instance just fine, since I set them up on January 18<sup>th, </sup>2011.

    However, today is February 1<sup>st</sup>, 2011, and the Sysutility_MDW database is now 300GB in size.  In SSMS I can see that the data file is really full, too.  There is only 1% of slack space.  I find this a bit shocking.  The monitored instances are not particularly busy, and all the collection schedules are at their default setting.  I had expected a monitoring database to be a few gigabytes in size. 

    I see that there is a daily job to purge the database, but it runs and runs and the database never shrinks. 

    This is not my first implementation of MDW.  We had previously set it up on our Development cluster as a clustered instance.  The Purge job ran continuously and actually hammered the network because the database was on an iSCSI drive, and it saturated the switch with iSCSI disk traffic. 

    We shut that down and created a new UCP/MDW solution on a dedicated instance, using local storage.  Now it’s not interfering with anything else, and the reports it generates are great, but the database has not stopped growing.   And the Purge job is constantly running.

    Things I’ve tried:

    ·         I have Googled the issue and I saw that there was a database bloat issue in SQL 2008 R1, caused by orphaned records, that was supposedly corrected in SQL 2008 R2, which is what I am running (http://support.microsoft.com/kb/971823)

    ·         I found one person who said he improved performance by adding a nonclustered index , which I went ahead and created today (http://blogs.msdn.com/b/petersad/archive/2009/04/23/sql-server-data-collector-nightly-purge-can-leave-orphaned-rows.aspx) .  No idea how effective it will be.

    ·         I found a blog post (http://www.toddbaker.org/blog/2010/12/17/sql-2008-mdw-fixing-long-running-purges/) by Todd Baker who also had my problem.  He solved it by modifying the stored procedure that handles purges ([Core].[SP_Purge_Data]) to use a temp table, allowing it to process records much more quickly.  I copied my SP_Purge_Data  to SP_Purge_Data2, cancelled my Purge_MDW SQL Agent job, and executed the sproc directly.  It is still running.

    I have several questions:

    1.       It is supported, or advisable, to use a single MDW database for both UCP and other data collectors? 

    a.       Could this be introducing orphaned records?

    2.       Is there a recommended maximum number of instances to monitor per MDW solution?

    a.       It occurs to me that the data collectors may be uploading data faster than the purge can keep up with. 

    3.       Has anyone tried the solution suggested by Todd Baker?  Does it work?  Is it supportable?

    4.       Is there something else I am missing?

     

    Wednesday, February 2, 2011 12:15 AM

Answers

All replies

  • Microsoft hasn't responded to this post yet, so I don't know what the "official" fix is supposed to be.

    However, I think I have a workaround in place. The modified stored procedure suggested in Todd Baker's blog post has worked for me. I manually ran it last night. It ran for 12 hours with the following effect:

    1. The primary data file is still padded at 212GB, but it now has 34% free (vs 1% free). I decided to leave it at its current size.
    2. The log file, which was 90GB and wouldn't let me shrink it (possibly due to pending transactions) went to 98% free and let me shrink it to its original size of 10MB.
    3. Apparently none of the statistical history is gone; I can still browse back to January 18th, when I first set it up.
    4. Reports are rendering much, much, much faster. For some reports, it used to take 30-60 seconds (or more) for the page to load. Now it's nearly instantaneous.
    5. UCP also seems to be working fine: it contains current data and it is performing faster, as well.

    So I think I am ready to recommend "The Baker Plan"
    http://www.toddbaker.org/blog/2010/12/17/sql-2008-mdw-fixing-long-running-purges/

    I was a little bit uncertain as to where to paste in his modified code, so I took a guess. It works for me. Note that anything you change in the business logic of the MDW database could (and probably will) be overwritten, or rendered inoperable, when Microsoft releases any future service packs. So, caveat emptor.

    Below is my modified "sp_purge_data2" sproc:

    USE [sysutility_mdw]
    GO

    /****** Object: StoredProcedure [core].[sp_purge_data2] Script Date: 02/02/2011 08:29:15 ******/
    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO


    CREATE PROCEDURE [core].[sp_purge_data2]
    @retention_days smallint = NULL,
    @instance_name sysname = NULL,
    @collection_set_uid uniqueidentifier = NULL,
    @duration smallint = NULL
    AS
    BEGIN
    -- Security check (role membership)
    IF (NOT (ISNULL(IS_MEMBER(N'mdw_admin'), 0) = 1) AND NOT (ISNULL(IS_SRVROLEMEMBER(N'sysadmin'), 0) = 1))
    BEGIN
    RAISERROR(14677, 16, -1, 'mdw_admin')
    RETURN(1) -- Failure
    END

    -- Validate parameters
    IF ((@retention_days IS NOT NULL) AND (@retention_days < 0))
    BEGIN
    RAISERROR(14200, -1, -1, '@retention_days')
    RETURN(1) -- Failure
    END

    IF ((@duration IS NOT NULL) AND (@duration < 0))
    BEGIN
    RAISERROR(14200, -1, -1, '@duration')
    RETURN(1) -- Failure
    END

    -- This table will contain a record if somebody requests purge to stop
    -- If user requested us to purge data - we reset the content of it - and proceed with purge
    -- If somebody in a different session wants purge operations to stop he adds a record
    -- that we will discover while purge in progress
    --
    -- We dont clear this flag when we exit since multiple purge operations with differnet
    -- filters may proceed, and we want all of them to stop.
    DELETE FROM [core].[purge_info_internal]

    SET @instance_name = NULLIF(LTRIM(RTRIM(@instance_name)), N'')

    -- Calculate the time when the operation should stop (NULL otherwise)
    DECLARE @end_time datetime
    IF (@duration IS NOT NULL)
    BEGIN
    SET @end_time = DATEADD(minute, @duration, GETUTCDATE())
    END

    -- Declare table that will be used to find what are the valid
    -- candidate snapshots that could be selected for purge
    DECLARE @purge_candidates table
    (
    snapshot_id int NOT NULL,
    snapshot_time datetime NOT NULL,
    instance_name sysname NOT NULL,
    collection_set_uid uniqueidentifier NOT NULL
    )

    -- Find candidates that match the retention_days criteria (if specified)
    IF (@retention_days IS NULL)
    BEGIN
    -- User did not specified a value for @retention_days, therfore we
    -- will use the default expiration day as marked in the source info
    INSERT INTO @purge_candidates
    SELECT s.snapshot_id, s.snapshot_time, s.instance_name, s.collection_set_uid
    FROM core.snapshots s
    WHERE (GETUTCDATE() >= s.valid_through)

    END
    ELSE
    BEGIN
    -- User specified a value for @retention_days, we will use this overriden value
    -- when deciding what means old enough to qualify for purge this overrides
    -- the days_until_expiration value specified in the source_info_internal table
    INSERT INTO @purge_candidates
    SELECT s.snapshot_id, s.snapshot_time, s.instance_name, s.collection_set_uid
    FROM core.snapshots s
    WHERE GETUTCDATE() >= DATEADD(DAY, @retention_days, s.snapshot_time)
    END

    -- Determine which is the oldest snapshot, from the list of candidates
    DECLARE oldest_snapshot_cursor CURSOR FORWARD_ONLY READ_ONLY FOR
    SELECT p.snapshot_id, p.instance_name, p.collection_set_uid
    FROM @purge_candidates p
    WHERE
    ((@instance_name IS NULL) or (p.instance_name = @instance_name)) AND
    ((@collection_set_uid IS NULL) or (p.collection_set_uid = @collection_set_uid))
    ORDER BY p.snapshot_time ASC

    OPEN oldest_snapshot_cursor

    DECLARE @stop_purge int
    DECLARE @oldest_snapshot_id int
    DECLARE @oldest_instance_name sysname
    DECLARE @oldest_collection_set_uid uniqueidentifier

    FETCH NEXT FROM oldest_snapshot_cursor
    INTO @oldest_snapshot_id, @oldest_instance_name, @oldest_collection_set_uid

    -- As long as there are snapshots that matched the time criteria
    WHILE @@FETCH_STATUS = 0
    BEGIN

    -- Filter out records that do not match the other filter crieria
    IF ((@instance_name IS NULL) or (@oldest_instance_name = @instance_name))
    BEGIN

    -- There was no filter specified for instance_name or the instance matches the filter
    IF ((@collection_set_uid IS NULL) or (@oldest_collection_set_uid = @collection_set_uid))
    BEGIN

    -- There was no filter specified for the collection_set_uid or the collection_set_uid matches the filter
    BEGIN TRANSACTION tran_sp_purge_data

    -- Purge data associated with this snapshot. Note: deleting this snapshot
    -- triggers cascade delete in all warehouse tables based on the foreign key
    -- relationship to snapshots table

    -- Cascade cleanup of all data related referencing oldest snapshot
    DELETE core.snapshots_internal
    FROM core.snapshots_internal s
    WHERE s.snapshot_id = @oldest_snapshot_id

    COMMIT TRANSACTION tran_sp_purge_data

    PRINT 'Snapshot #' + CONVERT(NVARCHAR(MAX),@oldest_snapshot_id) + ' purged.';
    END

    END

    -- Check if the execution of the stored proc exceeded the @duration specified
    IF (@duration IS NOT NULL)
    BEGIN
    IF (GETUTCDATE()>=@end_time)
    BEGIN
    PRINT 'Stopping purge. More than ' + CONVERT(NVARCHAR(MAX),@duration) + ' minutes passed since the start of operation.';
    BREAK
    END
    END

    -- Check if somebody wanted to stop the purge operation
    SELECT @stop_purge = COUNT(stop_purge) FROM [core].[purge_info_internal]
    IF (@stop_purge > 0)
    BEGIN
    PRINT 'Stopping purge. Detected a user request to stop purge.';
    BREAK
    END

    -- Move to next oldest snapshot
    FETCH NEXT FROM oldest_snapshot_cursor
    INTO @oldest_snapshot_id, @oldest_instance_name, @oldest_collection_set_uid

    END

    CLOSE oldest_snapshot_cursor
    DEALLOCATE oldest_snapshot_cursor

    PRINT 'Deleting orphaned rows from snapshots.notable_query_plan...'

    -- Delete orphaned rows from snapshots.notable_query_plan. Query plans are not deleted by the generic purge
    -- process that deletes other data (above) because query plan rows are not tied to a particular snapshot ID.
    -- Purging query plans table and the smaller query text table as a special case, by looking for plans that
    -- are no longer referenced by any of the rows in the snapshots.query_stats table. We need to delete these
    -- rows in small chunks, since deleting many GB in a single delete statement would cause lock escalation and
    -- an explosion in the size of the transaction log (individual query plans can be 10-50MB).
    DECLARE @delete_batch_size bigint;
    DECLARE @rows_affected int;
    SET @delete_batch_size = 500;
    SET @rows_affected = 500;

    --Wasn't sure if this WHILE clause was still needed
    WHILE (@rows_affected = @delete_batch_size)
    BEGIN
    -- DELETE TOP (@delete_batch_size) snapshots.notable_query_plan
    -- FROM snapshots.notable_query_plan AS qp
    -- WHERE NOT EXISTS (
    -- SELECT snapshot_id
    -- FROM snapshots.query_stats AS qs
    -- WHERE qs.[sql_handle] = qp.[sql_handle] AND qs.plan_handle = qp.plan_handle
    -- AND qs.plan_generation_num = qp.plan_generation_num
    -- AND qs.statement_start_offset = qp.statement_start_offset
    -- AND qs.statement_end_offset = qp.statement_end_offset
    -- AND qs.creation_time = qp.creation_time);
    -- SET @rows_affected = @@ROWCOUNT;
    -- IF(@rows_affected > 0)
    -- BEGIN
    -- RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_plan', 0, 1, @rows_affected) WITH NOWAIT;
    -- END

    ------------------------------------------------------
    ----------------BEGIN NEW CODE
    -- Pasted from http://www.toddbaker.org/blog/2010/12/17/sql-2008-mdw-fixing-long-running-purges/
    ------------------------------------------------------

    select
    sql_handle,
    plan_handle,
    plan_generation_num,
    statement_start_offset,
    statement_end_offset,
    creation_time
    into #nqp
    FROM snapshots.notable_query_plan qp
    WHERE NOT EXISTS (
    SELECT *
    FROM snapshots.query_stats AS qs
    WHERE qs.[sql_handle] = qp.[sql_handle]
    AND qs.plan_handle = qp.plan_handle
    AND qs.plan_generation_num = qp.plan_generation_num
    AND qs.statement_start_offset = qp.statement_start_offset
    AND qs.statement_end_offset = qp.statement_end_offset
    AND qs.creation_time = qp.creation_time)

    SET @rows_affected=1

    WHILE (@rows_affected>0)
    BEGIN
    DELETE TOP (@delete_batch_size)
    FROM snapshots.notable_query_plan
    FROM #nqp n
    WHERE
    n.sql_handle=notable_query_plan.sql_handle
    AND notable_query_plan.plan_handle = n.plan_handle
    AND notable_query_plan.plan_generation_num=n.plan_generation_num
    AND notable_query_plan.statement_end_offset=n.statement_end_offset
    AND notable_query_plan.statement_start_offset=n.statement_start_offset
    AND notable_query_plan.creation_time=n.creation_time
    SET @rows_affected = @@ROWCOUNT;
    IF(@rows_affected > 0)
    BEGIN
    RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_plan', 0, 1,
    @rows_affected) WITH NOWAIT;
    END
    END
    drop table #nqp
    ------------------------------------------------------
    ----------------END NEW CODE
    ------------------------------------------------------

    -- Check if the execution of the stored proc exceeded the @duration specified
    IF (@duration IS NOT NULL)
    BEGIN
    IF (GETUTCDATE()>=@end_time)
    BEGIN
    PRINT 'Stopping purge. More than ' + CONVERT(NVARCHAR(MAX),@duration) + ' minutes passed since the start of operation.';
    BREAK
    END
    END

    -- Check if somebody wanted to stop the purge operation
    SELECT @stop_purge = COUNT(stop_purge) FROM [core].[purge_info_internal]
    IF (@stop_purge > 0)
    BEGIN
    PRINT 'Stopping purge. Detected a user request to stop purge.';
    BREAK
    END
    END;

    -- Do the same purge process for query text rows in the snapshots.notable_query_text table.
    SET @rows_affected = 500;
    WHILE (@rows_affected = @delete_batch_size)
    BEGIN
    DELETE TOP (@delete_batch_size) snapshots.notable_query_text
    FROM snapshots.notable_query_text AS qt
    WHERE NOT EXISTS (
    SELECT snapshot_id
    FROM snapshots.query_stats AS qs
    WHERE qs.[sql_handle] = qt.[sql_handle]);
    SET @rows_affected = @@ROWCOUNT;
    IF(@rows_affected > 0)
    BEGIN
    RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_text', 0, 1, @rows_affected) WITH NOWAIT;
    END

    -- Check if the execution of the stored proc exceeded the @duration specified
    IF (@duration IS NOT NULL)
    BEGIN
    IF (GETUTCDATE()>=@end_time)
    BEGIN
    PRINT 'Stopping purge. More than ' + CONVERT(NVARCHAR(MAX),@duration) + ' minutes passed since the start of operation.';
    BREAK
    END
    END

    -- Check if somebody wanted to stop the purge operation
    SELECT @stop_purge = COUNT(stop_purge) FROM [core].[purge_info_internal]
    IF (@stop_purge > 0)
    BEGIN
    PRINT 'Stopping purge. Detected a user request to stop purge.';
    BREAK
    END
    END;

    END


    GO

    • Proposed as answer by david33221 Monday, October 19, 2015 8:43 AM
    Wednesday, February 2, 2011 5:53 PM
  • Hello,

    We are working on fixing this issue in next cumulative update. Please let us know if you would be willing to test this fix on your test machine. I can send you the fixed T-SQL code.

    You can email me at sethu.srinivasan@microsoft.com

    Thanks

    Sethu Srinivasan

    SQL Server team

    http://blogs.mdsn.com/sqlagent

    Wednesday, February 2, 2011 7:52 PM
  • Thanks for the great article. In my case I found that creating an index on the six field used by the first delete orphan records query to greatly improve the performance of the purge on my system.  Change it from taking well over 12 hours to running in 5 seconds.

    CREATE NONCLUSTERED INDEX [idx_query_stats_sqlhandle2] ON [snapshots].[query_stats]
    ([sql_handle] ASC, [plan_handle] ASC, [plan_generation_num] ASC, [statement_start_offset] ASC, [statement_end_offset] ASC, [creation_time] ASC)
    WITH (FILLFACTOR = 75, PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
    Pat B
    • Proposed as answer by Naomi N Monday, January 2, 2012 7:45 PM
    Wednesday, April 6, 2011 12:24 PM
  • Pat B - your index really helped the "fixit" script from the Microsoft Support article 971823 . Thanks.
    Robert O'Donnell - Consultant
    Friday, June 24, 2011 8:07 PM
  • Glad to hear it has benefited others Robert!

     


    Pat B
    Friday, June 24, 2011 8:11 PM
  • FWIW, the recent security update for SQL Server (KB2494088) does something to the MDW database.  I have not really examined it but last weekend a customer installed this update on a server running UCP, and it killed their master database.  The solution was easy, but the errors were scary.  Anyway in our case the hotfix failed because it was trying to update something in the MDW database but "Agent XPs" were disabled.  After I manually enabled them and restarted SQL, the update completed successfully.

    More about that here:

    http://social.msdn.microsoft.com/Forums/en-US/sqlsecurity/thread/447354c8-56df-46b2-a198-abbbecb4b1ff 

    So what this tells me is that for KB2494088, Microsoft slipped something into the MDW database for this release.  Be aware that any customizations you might have made to the database could be overwritten.

    When you install the hotfix, it updates binaries, then runs a couple of scripts against the system databases.  The output of these scripts appears in the SQL Error logs, so if you're interested what has been changed, I'd look there. 

     

    Friday, June 24, 2011 8:20 PM
  • I would recommend installing data collectors and utility points datawarhousing on enterprise sql edition so you can compress the snapshots.notable_query_plan table using page compression. We have been using data collectors in production for over a year to collect data on over 100 instances and the system works well. The large table issue is related to snapshots.notable_query_plan table which compresses down nicely using page compression. Row get it down but not nearly as much. But if you IT dept does not have the budget for enterprise then create a purge job for snapshots.notable_query_plan table to keep your data under control until ms has a fix.

    Thursday, August 25, 2011 2:44 PM
  • I have had some issues as you describe. I set the original retention periods high on purpose to see how far I would get. It took weeks to delete enough data from query-related internal tables.  I used this part of the Microsoft purge to get the query-related data cleaned up. My database is totally out of control still, even though I have reduced the monitored instances to two and reduced the retention periods across the board. 

    Here is what I used to free up space in the SSU database. Hope this is helpful. 

    USE [sysutility_mdw]

    GO

     

    declare @ctr tinyint, @delete_batch_size int;  

    SELECT @ctr =1; 

    select @delete_batch_size =500; 

    WHILE @ctr <= 10

    Begin 

    PRINT cast(@ctr as [varchar](5)) 

     

    BEGIN TRANSACTION

    DELETE TOP (@delete_batch_size) snapshots.notable_query_plan 

    FROM snapshots.notable_query_plan AS qp 

    WHERE NOT EXISTS 

       ( 

       SELECT snapshot_id 

       FROM snapshots.query_stats AS qs 

       WHERE qs.[sql_handle] = qp.[sql_handle] 

      AND qs.plan_handle = qp.plan_handle 

      AND qs.plan_generation_num = qp.plan_generation_num 

      AND qs.statement_start_offset = qp.statement_start_offset 

      AND qs.statement_end_offset = qp.statement_end_offset 

      AND qs.creation_time = qp.creation_time

       )

       ;

    COMMIT; 

    SELECT @ctr = @ctr+1; 

    End

     

    Monday, January 2, 2012 4:22 PM
  • I should have added *use with caution* since this procedure is my own quick fix. Also, please note that like another 'poster' I added indexes to expedite delete processing. 

     

    USE [sysutility_mdw]

    GO

     

    IF  EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[snapshots].[query_stats]') AND name = N'query_stats_ixtmp1')

    DROP INDEX [query_stats_ixtmp1] ON [snapshots].[query_stats] WITH ( ONLINE = OFF )

    GO

    CREATE NONCLUSTERED INDEX [query_stats_ixtmp1] ON [snapshots].[query_stats] 

    (

    [sql_handle] ASC,

    [plan_handle] ASC,

    [plan_generation_num] ASC,

    [statement_start_offset] ASC,

    [statement_end_offset] ASC,

    [creation_time] ASC

    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

    GO

     

    IF  EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[snapshots].[notable_query_plan]') AND name = N'notable_query_plan_ixtmp1')

    DROP INDEX [notable_query_plan_ixtmp1] ON [snapshots].[notable_query_plan] WITH ( ONLINE = OFF )

    GO

     

    CREATE NONCLUSTERED INDEX [notable_query_plan_ixtmp1] ON [snapshots].[notable_query_plan] 

    (

    [sql_handle] ASC,

    [plan_handle] ASC,

    [plan_generation_num] ASC,

    [statement_start_offset] ASC,

    [statement_end_offset] ASC,

    [creation_time] ASC

    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

    GO

     

    Monday, January 2, 2012 4:25 PM
  • I was just pondering this problem again today as I had to allocate more space to my mdw database - again.  I've been having to do this more and more frequently and only have disk counters enabled.  I came across this info that has basically resolved my problem with purge not working.  It's taking some time to run but I've recovering several GB of disk space by deleting orphaned records as described in the article:

    FIX: The Management Data Warehouse database grows very large after you enable the data collector feature in SQL Server 2008; http://support.microsoft.com/kb/970014

    Thursday, October 11, 2012 8:56 PM
  • Pat's index solved my problem. The job was running 17 hours plus before I killed it. After the index, the job  ran under 90 seconds. Good fix Pat..
    Thursday, January 3, 2013 8:13 PM
  • Dave, glad to hear my index worked for you. Simple indexes can render huge improvements in SQL Server performance.


    Pat B

    Friday, January 4, 2013 12:18 PM
  • Hi Nomi,

    I have this exec [core].[sp_purge_data] procedure which never finish. And it occurred open transaction in the tempdb.

    Please advise me. Should i create non-clustered index.

    Thanks


    Zahid

    Thursday, July 13, 2017 1:39 PM