none
Best method for "UPSERTS"? RRS feed

  • Question

  • Hello,

    I'm trying to do some "update if exists otherwise insert" logic in my SSIS package but am running into problems.  I researched and found that one way to do it is with Lookup transform, with redirection of error output.  But this causes problems as I run out of memory (I assume) because random components start failing for no reason.  My reference table has over 2 million rows also.

    Any ideas on best method of doing upserts?

    Tuesday, August 30, 2011 7:38 PM

Answers

  • My best experiences have been with the method it sounds like John proposed earlier in the thread:

    • Combine new data source with existing data by using a Merge Join that does a Left Outer Join (remember that both sources must be Sorted in a Merge Join). 
    • In the Data Source, when you SELECT existing data, add an "Update Flag" column (set to True, 1, whatever). 
    • In the Data Flow, following the Merge Join, all the New records will have Update Flag = NULL, all existing will have True/1. 
    • Continue processing. You can split the data flow at any time using a Conditional Split to check the value of the Update Flag column.  You should at least branch at the end of the Data Flow, Inserting the New records directly to the target table.  Existing records you insert in to a work table.
    • Having handled Inserts, you get back out into the Control Flow, and use an Execute SQL task containing a SQL statement that joining existing data with the work table to fo your Updates.

    If you are on SQL 2008, of course, the MERGE can be a much slicker option.  I think it performs better, as well.

     

    • Proposed as answer by Eileen Zhao Sunday, September 4, 2011 5:43 AM
    • Marked as answer by Ashish Khatri Monday, November 21, 2011 5:05 PM
    Thursday, September 1, 2011 7:30 PM

All replies

  • See:

    Get all from Table A that isn't in Table B

    http://www.sqlis.com/sqlis/post/Get-all-from-Table-A-that-isnt-in-Table-B.aspx


    Arthur My Blog
    By: TwitterButtons.com
    Tuesday, August 30, 2011 7:41 PM
    Moderator
  • I tried that method 2 in your link but it makes unrelated components fail from, I assume, lack of memory.  I think this because after removing the Lookup everything runs fine.
    Tuesday, August 30, 2011 7:53 PM
  • My preference is to write the data to a working table, and use a MERGE statement to perform an upsert. Of course, this assumes that the target of your package supports a MERGE or similar command. You can use something like the Batch Destination or the Merge Destination to facilitate this in the data flow.
    John Welch | www.varigence.com | www.agilebi.com | ssisUnit.codeplex.com
    Tuesday, August 30, 2011 8:06 PM
    Moderator
  • Your approach with lookup is correct. It should not fail until and unless you are running really low on memory. I would like to see the error that you are getting and that will tell us if it is indeed a memory issue or not. My guess is you have some data issue rather than memory.

    Another option is to use slow changing dimension, though I would not recommend that due to its performance issues.

    Tuesday, August 30, 2011 9:21 PM
  • Merge would be the best solution for upsert.. it perfor for the bulk data set. Put a execute sql task component and plz impement the modified below code

     

    MERGE INTO Sales.SalesReason AS Target
    USING (VALUES ('Recommendation','Other'), ('Review', 'Marketing'), ('Internet', 'Promotion'))
           AS Source (NewName, NewReasonType)
    ON Target.Name = Source.NewName
    WHEN MATCHED THEN
     UPDATE SET ReasonType = Source.NewReasonType
    WHEN NOT MATCHED BY TARGET THEN
     INSERT (Name, ReasonType) VALUES (NewName, NewReasonType)

    Regards,

    Vipin Nair

    Wednesday, August 31, 2011 2:44 AM
  • Beware that the TSQL MERGE statement is only available from SQL Server 2008 and up.

    If you are working with SSIS 2005, you'll need to use Lookups or a TSQL LEFT OUTER JOIN to determine inserts and updates.

    Whatever you do, do not use the SCD component, it is notoriously slow.


    MCTS, MCITP - Please mark posts as answered where appropriate.
    Answer #1: Have you tried turning it off and on again?
    Answer #2: It depends...
    Wednesday, August 31, 2011 1:06 PM
  • Why are people encouraging merge statements, I thought the benefit of using SSIS was that you could do this lookup & insert / update at one time? Try to set the cache option to "no cache" and let us know if you still have this error.

    As an alternative for the 'slowly changing dimension' scd task you can download the Kimball SCD component @ codeplex or work with a conditional split. http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx

     

    For the experts: What is faster the Merge or the lookup?

    Wednesday, August 31, 2011 2:32 PM
  • MERGE and lookup are 2 different things. Comparing them is apples and oranges. In this scenario, you'd use the MERGE statement to determine whether the row should be updated or inserted, which negates the need for the lookup.

    If you take the lookup approach, remeber that the OLE DB Destination only does inserts. If you don't use the MERGE, you must use an OLE DB Destination to do the inserts, and either a)  an OLE DB Command to perform the update, which is slow, since it sends row by row UPDATE statements to the database, or b) write the "update" data to a working table in the data flow, and use an Execute SQL after the data flow to perform a set-based UPDATE.

    Using MERGE gives you a cleaner data flow, since you only have one destinaton (the working table or MERGE capable destination component).

    Is it faster? My experience is that MERGE is faster than the OLE DB Command, unless you have a very small number of updates. It's also usually faster than a Lookup in No Cache mode, since that also becomes row-by-row processing against the database. However, as always, performance is very dependent on the database, so your mileage may vary.

    I do use the Lookup approach regularly as well - MERGE isn't an option on onlder versions of SQL Server. Different patterns apply in different situations.


    John Welch | www.varigence.com | www.agilebi.com | ssisUnit.codeplex.com
    Wednesday, August 31, 2011 2:49 PM
    Moderator
  • I am using 2005 so Merge is not an option. 

    I ran the package today and no errors showed up but it ran extremely slow and I saw it was paging like 2.5 GB off the disk, on top of the 1.5 GB of RAM it was using, so I know it was a memory problem.  I'll use johnwelch's approach of using a working table and then set based update and see if that speeds it up considerably.  Currently I am just using row-by-row OLE DB command update for the "success" output of the Lookup.


    Wednesday, August 31, 2011 3:16 PM
  • Hmm, it's not functioning properly still.  It stalls out at the update statement.  I'm thinking my update statement may be wrong.

    I've set the lookup to "use table or view" for the table that I'm upserting to.  This table has a 5-column key, so in Columns I've checked those 5 columns.  Then I redirect error output to the table I'm upserting to and send normal output to a working table.  Then I do a set-based update in an Execute SQL task after the dataflow task is complete:

     

    UPDATE p
    SET p.Col1 = t.Col1 ,
    p.Col2 = t.Col2 ,
    p.Col3 = t.Col3 ,
    p.Col4 = t.Col4
    FROM dbo.PrimaryTable p
    INNER JOIN dbo.TempTable t
    ON p.Key1 = t.Key1 
    AND p.Key2 = t.Key2
    AND p.Key3 = t.Key3 
    AND p.Key4 = t.Key4 
    AND p.Key5 = t.Key5 
    

     


    Wednesday, August 31, 2011 6:55 PM
  • There is a way to do this with limited caching. Set up two OLE DB sources one to each table you are comparing and sort them in SQL using your comparison key. Open advanced editor, for the output set "Is Sorted" to True and give the Key column a "Sort Key Position" of 1 (repeat for the other source). Then use a MergeJoin component set up as a left outer join this will give you an output with all the input data on the left and and matched rows on the right, unmatched rows will return NULL's on the right. Split off the unmatched rows and send them to a fast load destination. Send the remaining rows to an OLE DB Command for the update. Because the Merge relies on two sorted flows it does not require an asynchronous cache and therefor is less memory hungry. 

     

    Hope this helps

     

    Tim

    Wednesday, August 31, 2011 8:36 PM
  • Hi Ashish,

    we have tried upsers where we had to compare with more than 15Mil rows which we were able to do with limited memory avaliable. Please check if you are selecting only the required columns in the lookup (ID columns only) which will be joined. If you pull all  the columns it will definately take more space in memory, also check options like partial cache for lookup.

    Wednesday, August 31, 2011 8:51 PM
  • Hi John,

     

    Sorry for the lack of explenation, but I've ment src, lookups, scd, insert & update. Maybe it's because I've never seen a suitable situation where I could use merge statement, but I like the fact that you can extract your data, do calculations, FK lookups and update/insert the destination with SCD 1 or 2. I have seen a situation where they used merge statements but the "ETL" process exists only out of SQL tasks & scripttasks, yes dataflows are used but 1 to 1 to load data from src and store it in staging. this leads to a situation where you copy the data 3 times in different tables, ones in a staging, ones with the join to determine the FK's and one upsert statement. It works I agree but I always wondering if this is a good approach, it looks messy, difficult to maintain, row based errorhandling is not possible , redirection of unknown lookups, ...

    Maybe it's because I'm used to work with a DWH.

     

    Thanks in advance.

    Thursday, September 1, 2011 6:42 AM
  • To avoid updating identical records, you could use the conditional split, use your src & destination both as source, use on both the sort task and sort them on your logical key's, next use a union merge followed by your conditional split. First line is to check if your PK of destination is null, this flow goes to the insert ole db destination, next you check your src field with dest field, ex.: src.col1 != des.col1 || src.col2!= des.col2 ... this flow you can redirect to the update oledb command.

    I have found a screenshot on point 5 of the following link: http://www.resultdata.com/Company/News/Newsletter/2009/Jul/Articles/SQL/Merging%20Data%20in%20a%20SQL%20Server%20Integration%20Services%20Package%20Data%20Flow.htm

     

    hope this helps.

    Thursday, September 1, 2011 7:07 AM
  • Most of my work is in data warehousing as well. In cases where I use MERGE, I use the data flow for most of the data processing, so I still get row based error handling and other data flow benefits, except on the actual insert / update of the data. But I check all the constraints ahead of time in the data flow, so it is extremely rare to have a problem inserting or updating.

    Again, this is a pattern that applies to some situations but not to others. I've seen it deliver some solid performance increases in a numbe of situations. In others, it's on par with a typical approach. But one of the advantages of SSIS (and disadvantage, in some cases), is that there are multiple ways to do almost anything. If one way isn't delivering what you need, you can try a different approach.


    John Welch | www.varigence.com | www.agilebi.com | ssisUnit.codeplex.com
    Thursday, September 1, 2011 1:49 PM
    Moderator
  • I used a sql procedure with cursor and If Exists() class. It worked fine for me.

     

    best regards,

    Sengwa


    Best regards, Sengwa
    Thursday, September 1, 2011 2:19 PM
  • I used a sql procedure with cursor and If Exists() class. It worked fine for me.

     

    best regards,

    Sengwa


    Best regards, Sengwa

    A cursor? I think a little bit of me just died :)
    MCTS, MCITP - Please mark posts as answered where appropriate.
    Answer #1: Have you tried turning it off and on again?
    Answer #2: It depends...
    Thursday, September 1, 2011 6:23 PM
  • My best experiences have been with the method it sounds like John proposed earlier in the thread:

    • Combine new data source with existing data by using a Merge Join that does a Left Outer Join (remember that both sources must be Sorted in a Merge Join). 
    • In the Data Source, when you SELECT existing data, add an "Update Flag" column (set to True, 1, whatever). 
    • In the Data Flow, following the Merge Join, all the New records will have Update Flag = NULL, all existing will have True/1. 
    • Continue processing. You can split the data flow at any time using a Conditional Split to check the value of the Update Flag column.  You should at least branch at the end of the Data Flow, Inserting the New records directly to the target table.  Existing records you insert in to a work table.
    • Having handled Inserts, you get back out into the Control Flow, and use an Execute SQL task containing a SQL statement that joining existing data with the work table to fo your Updates.

    If you are on SQL 2008, of course, the MERGE can be a much slicker option.  I think it performs better, as well.

     

    • Proposed as answer by Eileen Zhao Sunday, September 4, 2011 5:43 AM
    • Marked as answer by Ashish Khatri Monday, November 21, 2011 5:05 PM
    Thursday, September 1, 2011 7:30 PM
  • I agree. This totally defeats the point of using an in memory ETL tool.
    Friday, March 27, 2015 8:33 PM