Beantwortet Using Transactions in ETL

  • Mittwoch, 11. April 2012 03:44
     
     

    Hi all,

    I am currently implementing CDC to be used by my ETL packages. Because CDC will now capture capture changes, inserts and deletes - I have removed all logic in my SSIS packages that identifies inserts, changes and deletes.

    My concern now is, how I manage ETL failures e.g. if one of the last packages in my ETL process fails, I cannot simply re-run the entire ETL process again with the same source extracts, because the records that were succesfully inserted (before the fail) will be duplicated (or crash due to db constraints).

    My guess is to use SSIS transactions across all packages, so that if there is any fail - the entire ETL run is rolled back.

    I've setup a sucessful Proof of Concept (PoC) that consists of 4 packages, each with 2 Execute SQL Tasks - one using OLE DB  connection and the other using ADO.NET, and an execute Package Task to the next package. Simply by setting the parent package's Transaction Option to Required (all others are default of Supported) - my entire run is contained in a transaction - if any component fails, all changes are rolled back.

    When I attempt to implement the same on my ETL packages, my packages fail i.e. My Parent (Master) Package sucessfully acquires a connections  & setup / inserts audit logging, It call a second Parent package (Master of Dimensions) which calls all dimension related ETL packages in parallel - here my "Target ADO" connection fails. Error message: [Execute SQL Task] Error: Failed to acquire connection "Target ADO". Connection may not be configured correctly or you may not have the right permissions on this connection. Perhaps this part of my question is best answered in the SSIS forum...

    My question here is, am I on the right track to wrap the entire ETL process in a single transaction? What are some other ways I could manage 'mid-way ETL fail' situations? Checkpoints perhaps?

    Thanks to everyone who has got through this long winded question.

    Regards, Clay

Alle Antworten

  • Mittwoch, 11. April 2012 03:51
     
     

    Just my two cents...

    Checkpoints might help you restart the ETL from where it left; I would suggest to split the complete process in multiple packages and use transaction on a small level instead for the whole process.

    I have tried to implement this at starting it seems very convincing approach but it gets ugly when the data is huge and it takes a long time to rollback the transaction - also fills up the temp space on server while doing this.

    Better option instead of rollback is to have a Delete/Truncate clean up task if any of the package fail - it will be much faster than rolling back the transactions. I know that writing the sequential delete cleanup task will be complex but it will be worth it instead of relying on huge transactions rolling back.


  • Mittwoch, 11. April 2012 04:27
     
     

    Thanks Harshvai, not what I wanted to hear - but writing my own clean up makes a lot of sense. I'll leave this thread  open for other suggestions for a while....

    Cheers.

  • Mittwoch, 11. April 2012 04:43
     
     
    I know... my last experience with this was horrible - we were supposed to dump data into a database owned by another team which had tables with lots of indexes and in very bad shape and also had triggers for audit - when the load failed it started to rollback transaction and in mid way it crashed - we were left in between and the only way was to restore the database and restart...
  • Mittwoch, 11. April 2012 14:25
     
     Beantwortet

    I use this approach succesfully in an ongoin basis. the key is to implement sequence containers an set the transaction option as required for the container that performs the insert in your destination as well as the insert in your control table. Assuming you are using one. In this way if there is a failure the package will read from the control table the last LSN used in a succesfull run to pass it as a parameter to the CDC function that retrieves data. in a nutshell as long as the insert to the destination and the insert on the control table you used to keep track of your LSN are rolled back you can always resume processing on the next run. Enroll only the minimum number or operations in a distributed transaction to avoid problems.

    Hope this helps,

    Luis @luisefigueroa

    • Als Antwort markiert clay123123123 Donnerstag, 12. April 2012 22:57
    •  
  • Donnerstag, 12. April 2012 00:50
     
      Enthält Code

    Thanks for the reply Luis,

    This sounds like an elegant approach, though I'm not sure what you mean by control table - I've always used Audit Logging which has always sufficed prior to the implemntation of CDC. I've extended this logging to included LSN info, but i think it must fall short of the functionality offered your 'control table'.

    One of my auditing (package logging) tables looks like this:

    CREATE TABLE [dbo].[AuditPkgExecution](
    	[PkgExecKey] [int] IDENTITY(1,1) NOT NULL,
    	[PkgName] [varchar](50) NULL,
    	[PkgGUID] [uniqueidentifier] NULL,
    	[PkgVersionGUID] [uniqueidentifier] NULL,
    	[PkgVersionMajor] [smallint] NULL,
    	[PkgVersionMinor] [smallint] NULL,
    	[ExecStartDT] [datetime] NULL,
    	[ExecStopDT] [datetime] NULL,
    	[SuccessfulProcessingInd] [char](1) NULL,
    	[ExtractStartLSN] [nvarchar](42) NULL,
    	[ExtractEndLSN] [nvarchar](42) NULL,
    	[ParentPkgExecKey] [int] NULL,
     CONSTRAINT [PK_dbo.AuditPkgExecution] PRIMARY KEY CLUSTERED 
    (
    	[PkgExecKey] ASC
    )

    The ExtractStartLSN and ExtractEndLSN are obtained and logged for the master package and passed through to all other packages through parent/child configs i.e. all packages use the same low and high LSN's for their extraction. Therefore, if a few my packages run succesfully and a subsequent package fails - i could not re-run the entire ETL process because of the duplicate issues already discussed. I suspect your control table alleviates this issue somehow  - are there any resources on the web you may be able to point me to which explains the use of a 'control table'

    Appreciate your time.

    Thanks. Clay

  • Donnerstag, 12. April 2012 16:08
     
     Beantwortet

    I use a "Control table" just to store any values I use as a baseline between ETL Runs. For example, in this case you need to pass a start lsn and end lsn parameter to retrieve CDC data. What I would do is insert on this control table the parameters used on every run. In that way the next time the ETL runs It can read from there what was the last end lsn succesfully used and use it as the starting point of the next run. The advantage is, if the transaction fails the control table does not get updated. It is a way to implement recoverability of the ETL after a package failure. You can also accomplish this using check points but I prefer the use of the control table.

    • Als Antwort markiert clay123123123 Donnerstag, 12. April 2012 22:57
    •  
  • Donnerstag, 12. April 2012 22:57
     
     

    Thanks Luis. I just need to obtain and log my start and end lsns for each package (that uses CDC functions) instead of once for the entire ETL run.

    Cheers.