locked
Best way to replace existing code depends on DB snapshot RRS feed

  • Question

  • We inherited some ETL processes that use SQL Server database snapshots as the main error handling tool for some long ETL processes (load data into large Data Warehouse databases).   Each process creates a DB snapshot at the beginning of the process, it either rolls back to this DB snapshot on error or drop the DB snapshot on successful completion of the job.  

    These are long-running processes (runs for 3+ hours) and the underlying DB is >3TB in size.

    Since DB Snapshot is not supported in most of the Azure Database offerings and we can’t automate such DB Snapshot restore in setups using SQL AlwaysOn Availability Group (DBA team has to take the DB out of the Availability Group before restoring the DB snapshot)  We want to redesign these processes to remove the dependencies on DB snapshots.   I would say the current processes (daily job) roll back to the DB snapshot maybe once every 3 months (80% time due to the bad data coming from the source system, 20% time due to networking issues during the middle of the job)   So rollback does not happen very often.

    Here are the challenges:

    - These ETL processes are too big to put into a single transaction, especially existing SSIS package invoke multiple control flows & data flows in parallel. 

    -  Depending on the type of error, there are times it can very difficult to resume of ETL process from the point of failure.   For example:  We have steps that truncate & reload tables and  sometimes run into errors after target table has been truncated & changed.    There are cases where it will be much easier to restore to the original DB snapshot, fix the problem (or disable a step in the process) and start everything from the beginning. 

    -  The size of underlying database is too big for us to make a DB backup at the beginning of the process (it will take 2~3 hours to just create that DB backup) where DB snapshots can be created instantaneously.    The company can’t afford to add another 2~3 hours to these already long ETL processes every day. 




    • Edited by zw2006 Thursday, February 6, 2020 5:07 PM
    Thursday, February 6, 2020 5:04 PM

All replies

  • Rather than trucating the target table directly, you could fill up a new table with the data, and once it has been completed, you use ALTER TABLE SWTICH to swap the tables. If you have tables that are related, you can switch them all in the same transaction. The switch operation is a metadata operation, so it is quick as long as it is not being blocked by running queries.

    When it comes to backup, I do hope that you backup the database every once in a while in one way or another. If you are in full recovery, you can also do point-in-time restores. Then again, if you are in full recovery, the load operations may take longer time. But if you are in bulk-logged recovery, you can backup the database some hours before the load starts, and then backup the log just before the load starts, so that you can recover that way.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Thursday, February 6, 2020 10:31 PM
  • Hello,

    Maybe this is an immature idea that could you create DB snapshot manually before the beginning of ETL process.? Then you don't have wait another 2~3 hours for ETL process?  

    Here is an article about create DB snapshot using T-SQL:https://docs.microsoft.com/en-us/sql/relational-databases/databases/create-a-database-snapshot-transact-sql?view=sql-server-ver15


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Friday, February 7, 2020 9:57 AM
  • We are already using T-SQL to create DB snapshot before the ETL process (then either drop it or restore it back)    The question is about how to rewrite the existing logic so we don't have to use DB snapshot for the same process.

    Currently the underlying DB is in Simple Recovery mode.   It's on weekly full DB backup & daily differential backup schedule.      I think changing the database to full recovery mode and do point-in-time restore (mentioned by Erland) may be one of the option we will look into.     Hopefully we can trigger that point-in-time DB restore automatically from the ETL process (do not need to wake up DBA at night to have them do a DB restore at night for us) 


    Friday, February 7, 2020 6:32 PM
  • Currently the underlying DB is in Simple Recovery mode.   It's on weekly full DB backup & daily differential backup schedule.      I think changing the database to full recovery mode and do point-in-time restore (mentioned by Erland) may be one of the option we will look into.     Hopefully we can trigger that point-in-time DB restore automatically from the ETL process (do not need to wake up DBA at night to have them do a DB restore at night for us) 

    Note that switching to full recovery means that you cannot take benefit of minimally logged transactions, which I would assume that your load process does today. Thus, will full recovery you may see longer load times - and quite a strain on the transaction log.

    Then again, in your original post, you talk about Availability Groups, and if you put the database in an AG, you have no option but full recovery.

    Even if the restores you talk about don't happen very often, they appears to happen enough that a RESTORE is not really an option. If the backup takes several hours, so does the restore, and then you need to rebuild the AG. Do you really want to do that every three months?

    Did you consider the idea with ALTER TABLE SWITCH that I suggested?


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Friday, February 7, 2020 10:18 PM
  • I will look into the Alter table switch.    Truncation/reload table is the just one of the reason why we rely on DB snapshots.     There other instances we update data in multiple tables in a loop  so a lot of  existing code need to be re-designed in order to support error recovery for those process without using DB snapshot.

    Anyway,  I was hoping to find some solutions using non-SQL tools, such as: putting the all the databases files (data & log)  need DB snapshot on a single disk drive on an Azure VM.   Then have the ETL process take a disk snapshot of that entire disk drive before the process starts (maybe that's possible with Azure managed disk?)  and the ETL process can restore all the data (from the disk snapshot) back to that drive  if something fails...        Just not sure if such idea is really doable in Azure.   

       

    • Edited by zw2006 Monday, February 10, 2020 5:42 PM
    Monday, February 10, 2020 5:41 PM
  • Ok, if you are also updating tables, ALTER TABLE SWITCH is not the sole solution.

    I don't know that much about Azure that I can say whether your idea is possible. But given that there is an AG involved, I can see some problems with reverting to the disk snapshot.

    Then again, a possible idea with an AG, is take down on replica when the operation starts, and in case of error fail over to the replica in that state. I'm not sure that this will actually work, though.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Monday, February 10, 2020 10:30 PM