none
Daily purge affecting business RRS feed

  • Question

  • We're using SQL Server 2008 to host an application that logs certain events to a table including columns such as...

    MessageId INT (PK)
    LogDateTime DATETIME
    others...

    While these records can be updated, they're mostly just written and then read later on.

    Once every 24 hours a purge sproc is run to delete "old" data.  This purge is done with code similar to....

    DELETE FROM Messages
    WHERE (DATEDIFF(Day, Messages.LogDateTime, GETDATE()) > 30

    While that approach works, we're finding it's not very efficient.  The application logs about 200,000 records a day, so the number of records being deleted every night is that same number.  It can take 5s or 10s of seconds to complete.  During that time, the application is not able to add records, presumably because the Delete has the table locked.

    We don't really care how long that purge process takes, but we can't have it keeping the application from adding new records.

    It's been suggested we index the LogDateTime column, but we're reluctant to do that, thinking the application's throughput of adding records would take a hit with all of the indexing taking place.

    Another thought is to reduce the "math" of the DATEDIFF by perhaps finding the one youngest record to delete and using its MessageId in a where clause, something like DELETE FROM...  WHERE MessageId < 12345.

    Should the purge process be using a single DELETE or breaking it down into several DELETES, allowing the application access to the table between DELETES?

    Any thoughts?

    Thanks!  -- Curt

    Friday, July 8, 2016 12:50 PM

Answers

  • It's been suggested we index the LogDateTime column, but we're reluctant to do that, thinking the application's throughput of adding records would take a hit with all of the indexing taking place.

    I wouldn't expect the additional index on LogDateTime to have a significant performance impact with only 200K inserts a day.  The index key is incremental so fragmentation due to index page splits should be virtually none.

    The benefit of the index will not only help the performance of the daily delete, but greatly improve concurrency during the delete since only those needed rows are touched (about 3% of the rows instead of all rows).  This assumes you take the advise Scott suggested so that the expression is sargable.


    Dan Guzman, Data Platform MVP, http://www.dbdelta.com

    Saturday, July 9, 2016 10:01 PM
  • WHERE (DATEDIFF(Day, Messages.LogDateTime, GETDATE()) > 30

    This type of logic will always be inefficient because it requires the engine to calculate a value from stored values - which will prevent it from using potential useful indexes (which apparently you don't have).  So rather than calculate the difference between <now> and LogDateTime, you should calculate the datetime boundary that determines the point from which you want to keep rows and remove those that are earlier. E.g.,

    where Messages.LogDateTime < dateadd(day, -30, cast(GETDATE() as date))

    Batching the deletes, as Sebastian suggests, sounds promising.  Indexing the LogDateTime column in some fashion will also likely help.  I don't think you will experience a significant impact in performance with just 200k rows per day.  You will experience some growth in file space. But there is only one way to know - you just try it. And since we're guessing, is MessageID an identity column?  And is LogDateTime specified during the insert (or does it default to getdate() )?  Is either column ever updated?  Depending on the answers, you can argue that the clustered index should include LogDateTime.

    Friday, July 8, 2016 2:31 PM

All replies

  • Hi,

    I'd recommend you to split the huge delete into smaller chunks, for example with a cursor or with a 

    WHILE EXISTS(SELECT * FROM Table WHERE Condition)

    DELETE TOP 100 FROM Table WHERE Condition;

    Friday, July 8, 2016 1:15 PM
  • WHERE (DATEDIFF(Day, Messages.LogDateTime, GETDATE()) > 30

    This type of logic will always be inefficient because it requires the engine to calculate a value from stored values - which will prevent it from using potential useful indexes (which apparently you don't have).  So rather than calculate the difference between <now> and LogDateTime, you should calculate the datetime boundary that determines the point from which you want to keep rows and remove those that are earlier. E.g.,

    where Messages.LogDateTime < dateadd(day, -30, cast(GETDATE() as date))

    Batching the deletes, as Sebastian suggests, sounds promising.  Indexing the LogDateTime column in some fashion will also likely help.  I don't think you will experience a significant impact in performance with just 200k rows per day.  You will experience some growth in file space. But there is only one way to know - you just try it. And since we're guessing, is MessageID an identity column?  And is LogDateTime specified during the insert (or does it default to getdate() )?  Is either column ever updated?  Depending on the answers, you can argue that the clustered index should include LogDateTime.

    Friday, July 8, 2016 2:31 PM
  • Yes, MessageId is an identity column.  Starts at 1, auto-increments...

    Records are inserted inside an sproc, which includes a call to GETDATE() for the LogDateTime field.

    Neither column is ever updated.

    >> you can argue that the clustered index should include LogDateTime

    Interesting.  So, although there can be only one clustered index, it can include multiple columns?

    Friday, July 8, 2016 3:27 PM
  • Yes, clustered index can use multiple columns
    Friday, July 8, 2016 4:12 PM
  • Hi CurtisDeHaven,

    For this case, you may consider using a partitioned table instead.


    Sam Zha
    TechNet Community Support

    Saturday, July 9, 2016 2:29 PM
    Moderator
  • Hi CurtisDeHaven,

    When data is being deleted from a table, you want to ensure the table is insertable, right? If so, why don't you consider SNAPSHOT transaction isolation level? Or you can set READ_COMMITTED_SNAPSHOT option at database level.

    And for the calculation within WHERE clause, you can try to pre-calculate the date boundary.

    This is a sample.

    SET TRANSACTION ISOLATION LEVEL SNAPSHOT
    GO
    
    DECLARE @DateBoundary DATETIME=DATEADD(DAY, -30, GETDATE())
    
    DELETE A
    FROM dbo.Table_1 A
    WHERE A.LogDateTime<@DateBoundary

    You can refer to this article for detail about transaction isolation level:

    https://msdn.microsoft.com/en-us/library/ms173763.aspx?f=255&MSPPError=-2147217396

    Please feel free to let me know if there is anything misunderstood.

    Regards,

    Albert

    Saturday, July 9, 2016 8:18 PM
  • It's been suggested we index the LogDateTime column, but we're reluctant to do that, thinking the application's throughput of adding records would take a hit with all of the indexing taking place.

    I wouldn't expect the additional index on LogDateTime to have a significant performance impact with only 200K inserts a day.  The index key is incremental so fragmentation due to index page splits should be virtually none.

    The benefit of the index will not only help the performance of the daily delete, but greatly improve concurrency during the delete since only those needed rows are touched (about 3% of the rows instead of all rows).  This assumes you take the advise Scott suggested so that the expression is sargable.


    Dan Guzman, Data Platform MVP, http://www.dbdelta.com

    Saturday, July 9, 2016 10:01 PM
  • /* Can you try this */


    DECLARE @FromDate DATETIME; SET @FromDate = DATEADD(DAY, -30, GETDATE()); --Cache MessageId(s) : that needs to Get Deleted SELECT MessageId INTO #DEL_MessageIds FROM [Messages] WHERE LogDateTime< @FromDate; -- Then Apply DELETE based on Messageid DELETE FROM m FROM [Messages] m JOIN #DEL_MessageIds del_m ON m.MessageId = del_m.MessageId;



    rajivkumar.bala@yahoo.co.in

    Saturday, July 9, 2016 11:23 PM