Checking to see if a record exists and if so update else insert
- I've decided to post this as a sticky given the frequency this question is asked.
For those of you wishing to build a package that determines if a source row exists in the destination and if so update it else insert it, this link is for you.
http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
Thanks Jamie!
If you want to do a similar concept to Jamie's blog post above, but with the Konesan's Checksum Transformation to quickly compare MANY fields, you can visit here:
http://www.ssistalk.com/2007/03/09/ssis-using-a-checksum-to-determine-if-a-row-has-changed/
Phil
All Replies
- This is exactly what I was looking for.
I appreciate that.
Fahad - It's an awesome article - and I used this method in my 1st SSIS package.
However, if number of records is high (over 1000) the OLE Update Command performs slowly. So in that case you might better off loading source data to staging table and creating Stored Procedure which Updates then Inserts (and Deletes) data in the destination table.
Using Lookup transform in the way described in article is quite efficient (compared to other methods like merge join) - and can be useful in a variety of situations - TheViewMaster,
Yes, very valid points. Each developer will have to decide based on the pros and cons of each method.
Personally, I wouldn't go to the extreme that you did, but would rather dump just the updates to a staging table and then outside of the data flow use an Execute SQL task to perform the batch update. I don't want to spend the cost to insert the data twice (once into staging table, and then again into the destination table using your idea). - Can somebody point me to a simple example of this.
I have a flat file and I want to check if the record exist in the file exist in the table, if it does I want to update and if it does not I want to insert. This is very helpful. Thanks!
I am completely new to the SQL Server Integration Services. I went through your link and understand how you filter out diffrent records. Could you pointting out how you actually update you existing records ?
Thanks!
- I have cloned the example above and it does not seem to be picking up the new or changed records. Is there any way to troubleshoot this.
- I use OLE DB Command update the records. Thanks!
Hi phill,
Great article, I loved it.
This is what i had been looking for since long time.
One doubt is there in my mind, my current assignment i have to transform about 40K rows from a database.
Will this method work for me or there is any other better way.. ?
regards
Sudhir Kesharwani
Sudhir Kesharwani wrote: Hi phill,
Great article, I loved it.
This is what i had been looking for since long time.
One doubt is there in my mind, my current assignment i have to transform about 40K rows from a database.
Will this method work for me or there is any other better way.. ?
regards
Sudhir Kesharwani
You need to likely send your changed records to a temporary table and then perform a batch update using an Execute SQL Task on the control flow. Using the OLE DB Command transformation isn't very efficient for high numbers of rows.Phil,
As we discussed about updating/inserting techniques a while ago(see http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1211321&SiteID=1) you told me SSIS can perform this operation much faster than using plain SQL... I put the emphasize on the "much faster".I have created a performance test package comparing the 2 methods:
1) stored proc with UPDATE and INSERT statements
2) Data flow using look up/conditional splitI have a data flow task to create a data test set with a variable number of rows (I used Jamie's script component source example) with IDs (used as PK) always ranging from 1 to nbrRows. Each row is constituted of 5 int, 4 string (10, 20, 40 and 80 characters) and 1 datetime with values chosen randomly.
So what I do is calling the package with nbrRows=1, 10, 100... The first time 1 new row is added. The second time, id=1 already exists so 1 row is updated, the other 9 are added. And so on... I then perform a second round of tests with the same numbers, the only difference being that the destination table is now full (therefore performing only updates).
What I've found out so far is that SSIS is faster only when inserting a certain number of rows (approx . 1000 to 100000). SSIS performed on avg. 30% better with 50000 rows while at 1 million rows, SQL performed better by 5%.
When the data was updated (second round), SQL was always faster.
Of course, these figures are based on my tests which may not reflect actual usage. I run SQL/SSIS in parallel, the tests are done in batch, .... I thought about using DBCC DROPCLEANBUFFER but it doesn't seem to change much.
Maybe my way of comparing is totally flawed... Have you ever tried comparing the two methods in terms of performance? How did you conclude the SSIS way is much faster??
Leonce
Leonce,
First, when using fast load option on the destination, there are two parameters which need to be adjusted based on the rows you are working with. Max Insert Commit Size and Rows Per Batch. What were the values for these set at?
Also note, that when possible, using a SQL Server Destination over the OLE DB Destination would be preferred.
SSIS should be no different than using BULK INSERT (it seems to be just a glorified wrapper to that). Performing updates via an OLE DB Command will always be slower than a batch update using SQL. (Only because the OLE DB Command is executed for every row in the data flow)
Phil
I left the default values for Commit Size and Rows Per Batch ('' and 0). You suggest I should adjust these values depending on the number of rows transferred? What would typical values be for a table with 100000 rows? Does it dependent on the row size?
I use OLEDB Destination, I'll try with SQL Server Destination.
I forgot to mention I use your method for updates (dump into table then update). Do you think SSIS can be faster than plain SQL for updates? I would tend to think it's impossible.
Leonce
fleo wrote: I left the default values for Commit Size and Rows Per Batch ('' and 0). You suggest I should adjust these values depending on the number of rows transferred? What would typical values be for a table with 100000 rows? Does it dependent on the row size?
MICS should be set to whatever you wish the batches to be committed. So if you want to commit in 10,000 row batches, set MICS to 10,000. Rows Per Batch should be set to 100,000 in this case.
fleo wrote: I forgot to mention I use your method for updates (dump into table then update). Do you think SSIS can be faster than plain SQL for updates? I would tend to think it's impossible. True. Using an Execute SQL task will likely have a slight slowness to it compared to straight SQL, but in the end they do the same thing, so they should be REAL close.
Let us say that a package uses merge join option to insert new rows and it uses OLE DB Destination with fast load, table lock, and batch size (1000) options. The table lock option on the destination forces us to put a NOLOCK hint in the OLE DB Source query (on the destination table). Otherwise, they run into locking contentions (one is requesting exclusive table lock when the other is requesting shared table lock). This works most of the times but occationally I see that SQL Server does not honour the hint (one scenario: when the target table does not have a clustered index on the join key column) and runs into locking contentions. Have you seen this? Do you rely on NOLOCK hint too? Thanks for your help in advance.
bkallich wrote: Let us say that a package uses merge join option to insert new rows and it uses OLE DB Destination with fast load, table lock, and batch size (1000) options. The table lock option on the destination forces us to put a NOLOCK hint in the OLE DB Source query (on the destination table). Otherwise, they run into locking contentions (one is requesting exclusive table lock when the other is requesting shared table lock). This works most of the times but occationally I see that SQL Server does not honour the hint (one scenario: when the target table does not have a clustered index on the join key column) and runs into locking contentions. Have you seen this? Do you rely on NOLOCK hint too? Thanks for your help in advance.
I don't use costly merge joins if I can avoid it because the data has to go through a sort upfront. Using a lookup component will cache the records first, before data has a chance to hit the destination table. With that said, I haven't used a merge join in my situations for checking to see if a record exists because the amount of rows hasn't been high enough for me to warrant avoiding the use of a lookup.- Thanks Phil. I will try the lookup transformation instead. I asked the same question on your link (http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx) too and Jamie had the comment to the same effect. Sorry about asking the same question twice.
- Hi everybody,
Any one tell me how to use "the batch update" as the solution above.
I have to update over 20.000 records by using cursor in a proc. It runs vey slowly.
Thanks hoo....that a great article. really helps alot.
but however, i want to know can it be done between two different database? like between MySQL and SQL Server Mobile Edition?? well, you see...i have a system that uses MySQL in desktop and ive install the sql server mobile edition into my pocket pc. the question is ..can only selected data be transfer between those two? if yes, how? by the way, the system in desktop that uses the MySQL is alrdy a complete system and i hv been asked to build its mobile version, which is a TOTALLY new exprience to me. any idea???
thank you in advance. it would be very much appreciated. save my world.
- Hi all,
I too tried using the same way. But i have strucked when i used oracle oledb provider for oledb command. i have got the following error.
Error at Data flow task[oledb command[8717]]: An oledb error has occured. Error Code:0x80040E51. An Ole db record is available. Sourece:"OraOledb" Descrition:"Provider cannot derive parameter information and SetParameterInfo has not been called"
Unable to retreive destination column descriptions from the parameters of the sql command
Please help me in this regarding.
Regards,
Roopa hi Phil,
I am new this technology,thanks for sharing a good knowldge.
Its a great pleasure to get good knowledge from this forum.
maruthi
Well, i usually use 2 SQL Statements for this situation...
Till now, all other methods rather than this seem to affect performance. I know that probably my maintenance is not that user friendly but i know its efficient and old enough to be considered safe enough

Maybe in the future, MS will release a special task for this... Who knows? Or maybe we can start an opensource one... why not?

Best Regards,
Luis Simões wrote: Well, i usually use 2 SQL Statements for this situation...
Till now, all other methods rather than this seem to affect performance. I know that probably my maintenance is not that user friendly but i know its efficient and old enough to be considered safe enough
Maybe in the future, MS will release a special task for this... Who knows? Or maybe we can start an opensource one... why not?
Best Regards,
Microsoft has implemented a MERGE T-SQL function in SQL Server 2008... That would also work.
See John Welch's post here about it: http://agilebi.com/cs/blogs/jwelch/archive/2007/06/16/using-the-merge-statement-in-sql-server-2008.aspx- Hi Phil,
I just wondering, if we have millions records of data, is it possible to use this method since it used to read each record ? I mean, I usually use Inner Join, Right Join and Left Join to find the deleted record, inserted record or updated record. How about the performance of this method comparing the Join method ?
Best regards,
Hery - JOINs are usually faster, assuming appropriate indexes exist. However, using lookups is extremely useful in situations where the source data is not in a relational table, or the source table and the reference table are on two different servers. Also, lookups cache data, so millions of rows aren't necessarily a problem.
Phil Brammer wrote:
Microsoft has implemented a MERGE T-SQL function in SQL Server 2008... That would also work.
See John Welch's post here about it: http://agilebi.com/cs/blogs/jwelch/archive/2007/06/16/using-the-merge-statement-in-sql-server-2008.aspxIt appears that the domain name for this URL has been, how do you verb this, "squatted upon".
John Saunders wrote: Phil Brammer wrote:
Microsoft has implemented a MERGE T-SQL function in SQL Server 2008... That would also work.
See John Welch's post here about it: http://agilebi.com/cs/blogs/jwelch/archive/2007/06/16/using-the-merge-statement-in-sql-server-2008.aspxIt appears that the domain name for this URL has been, how do you verb this, "squatted upon".
Why do you say this? The link works fine for me. The DNS servers for the domain both respond, and return the correct IP address for the domain (along with the WWW record)
The site was down for a few hours (John Saunders alerted me to it - thanks again for that, John). I have the site hosted with one provider and the name registered with another one. For some reason, they choose not to make me aware that domain registration was expiring. It was fixed within about 3 hours of it going down, and should be good from this point forward.
Hi Phil, first of all, great post and many thanks for sharing this article.
I like a lot the first solution, which I addopted.
I just have a question to make. Since we do this lookup to see if it is an update or insert, in the data-flow, I would like to proceed to update and insert tasks. For insert, I use OLE DB DESTINATION inside of the data-flow (when we get the "error" from lookup). Till here, no problem.
But for update task, what you suggest? I think that I should take my key field, and update by this reference, right? but where? Can this be done at the end of this data-flow task, where there are the UNION ALL tasks? Or can this only be done outside data-flow, in the Control Flow, with SQL Task? If so, I should take in consideration, if data-flow ended with error or correctly, to know if it was an insert, or can I do the update respectively. Am I right, or I missed something here?
Thanks.
- You can use an OLE DB Command to issue Updates inside the data flow. That can slow performance, though, since the updates are issued row by row. A better performing technique is to send the rows to be updated to a second OLE DB Destination so that they are saved in a working table, and then use an Execute SQL after the data flow to perform a batch update.
- Hi all.Thanks jwelch for the reply. Well, I'm trying to make the update inside the data-flow, using OLE DB DESTINATION. Here is the link: http://i85.photobucket.com/albums/k61/marcoadf/dts.jpgAs we can see, I have the lookup before. The insert is working fine with the "error" from lookup. I just have a OLE DB DESTINATION to the table where I wanna insert and mapp the columns.My problem is how can I make the update here. How can I make my sql to update (column x=1) where the line to be updated is equal to the column returned from lookup (y=?). That's the only way I can make the update right? Or the solution is something like:UPDATE TSET X=?FROM TABLE TWHERE Y=?and the ? are ordered as the resultset coming from lookup, am I right? This means that first column must be the updatable column and the second ? must be the column_cod that is the key to update right?Thanks in advance.
You can't do this. You will have to use an OLE DB Command Transform component instead.
marcoadf wrote: Hi all.Thanks jwelch for the reply. Well, I'm trying to make the update inside the data-flow, using OLE DB DESTINATION. Here is the link: http://i85.photobucket.com/albums/k61/marcoadf/dts.jpgAs we can see, I have the lookup before. The insert is working fine with the "error" from lookup. I just have a OLE DB DESTINATION to the table where I wanna insert and mapp the columns.My problem is how can I make the update here. How can I make my sql to update (column x=1) where the line to be updated is equal to the column returned from lookup (y=?). That's the only way I can make the update right? Or the solution is something like:UPDATE TSET X=?FROM TABLE TWHERE Y=?and the ? are ordered as the resultset coming from lookup, am I right? This means that first column must be the updatable column and the second ? must be the column_cod that is the key to update right?Thanks in advance.In the second OLE DB Destination, you need to write the rows to a second, working table. Then you'd issue a batch UPDATE in an Execute SQL.
UPDATE table
SET colA = tableB.colA
FROM table, tableB
WHERE table.ID = tableB.ID
Hi, all experts here,
I dont really understand why there is so many complex methods of checking a row to see if it exists or not.
Why dont we just use a Slowly Changing Dimension (SCD) component to sort it all out?
With SCD, by setting up the business key for the table where the records are in, we could easily check out if a record exists or not, and then if not, then the component will insert the new record for you.
If it exists, then we can update it with options to keep the historical records or not.
Is it a better way of dealing with it?
Looking forward to hearing from you for more advices.
Thanks.
Best regards,
| Helen999888 wrote: | |
|
Helen,
In my opinion these methods are a lot simpler than the SCD component which hides all its internal machinations underneath a wizard. It also tried to do a lot more than what we need it to do here. Just my opinion.
The main problem with the SCD component is that it issues a SQL query for every row in the dataflow, it has no ability to cache data like the Lookup component does.
-Jamie
| Jamie Thomson wrote: | ||||
|
In line with Jamie's comment, one of the performance recommendations from Microsoft is to use a Lookup before the SCD transform to make sure it has less rows to process.
My opinion - if all you are doing is seeing if a row exists or not, the SCD transform is overkill. If you actually want to check each attribute for changes (to handle mixed Type 1 and Type 2 columns) the SCD transform might be worth it. Even in cases where I have used it, the first thing I do is delete the OLE DB Command the wizard creates and send the rows to a temp table so I can do a batch update.
Hello Phil, What you planted a year ago has is now full grown and helps someone like me who is new to enjoy the fruit. Thanks for all those who contibuted, making it easier for newbies like me.
Phil, Jamie and to all -
Thank you guys for sahring all your very wealthy knowledge and experiences regarding SSIS. All of you are the best! This forum became my bible since i started using SSIS. It indeed make a lot of difference...
Keep up the good work!
Concon
| jwelch wrote: | |
|
Funny you should mention this. I was just debugging a package that does a lookup against a table with millions of rows. Watching the output window, I saw that the lookup was caching data for several minutes. Is this going to happen more quickly in subsequent runs of the same package, even though the table being referenced by the lookup is ultimately the table that will have new rows inserted into it by this same package?
Or is there a more definitive solution for testing and inserting only new rows into a table that contains millions of rows using SSIS? My first thought was to use variables to limit the scope of the lookup, but it looks like variables aren't available in the lookup component. That would be a nice improvement to future versions.
In 2005, the row caching will take the same amount of time each execution. In 2008, you can store the cache file between executions of the package.
| jwelch wrote: | |
|
I'm performing this insert on a table that contains over 150 million rows now, and will continue to grow at a fairly fast pace. I let my SSIS package run over night, and after caching just over 26 million rows, it errored out with a mysterious "OLEDB Error while populating internal cache".
I'm proceeding with the assumption that a lookup isn't a good idea on a table this size, and have two other ideas to try next:
1. Prefilter my datasource by doing a JOIN to the Destination table, and weeding out the duplicates in the source table that way.
2. Let the duplicates pass to the Destination, and ignore the error when the PK Constraint doesn't allow them to be inserted.
Any advice or warnings would be happily appreciated!
The link http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx is unavailable or moved. Does anyone know where the information of this link is? I need this...
Thanks for help
Perhaps the site was just down for a little while. (Conchango changed hands to EMC - note the new URL - but I was redirected successfully.)

As such, I don't know that I would trust that link to be valid for too much longer; although I have absolutely no basis for this statement. I just have no clue what the norm would be for the average companies policies in regards to maintaining previous employees blog contents.
Please mark answered posts. Thanks for your time.
Doing Insert and Update for millions of rows can be very time consuming and tedious job.
If some one want to do high peformance Upsert check component offered by Pragmatic works .. its called UPSERT component.
http://www.pragmaticworks.com/products/business-intelligence/taskfactory
You will be amazed how easy it is to setup Upsert and Performance is several times faster than traditional technique.
I hope this will save some pain for you :)
Cheers!!
Doing Insert and Update for millions of rows can be very time consuming and tedious job.
If some one want to do high peformance Upsert check component offered by Pragmatic works .. its called UPSERT component.
http://www.pragmaticworks.com/products/business-intelligence/taskfactory
You will be amazed how easy it is to setup Upsert and Performance is several times faster than traditional technique.
I hope this will save some pain for you :)
Cheers!!
apparently, this would be advantageous for those using ssis 2005 or a destination other than sql server...but i'm not sure if performs better than the t-sql MERGE statement in sql server 2008. MERGE can perform batch upsert in a single DML statement: http://www.ssisbi.com/facilitating-a-delta-refresh-with-ssis-using-the-t-sql-merge-statement/
Duane Douglas, MCAD, MCITP | http://www.ssisbi.com | Please mark the post(s) that answered your question.
Hi, I had a question about the first article. (SSIS Junkie) I'm really new to SSIS, so please bare with me. I've gotten the package to work fine, and it does insert new rows, but it seems to me that the article has left out how to update rows if it finds an age that has changed? Can anyone tell me how I can update the changed ages after the union all task All Terminate Flow 2? Or if I am misunderstanding the article, can you tell me where the package updates rows? Thanks.
for some odd reason, i'm unable to pull up jaime's blog in my browser.
if you make a new thread, (perhaps) someone else can assist you.
Duane Douglas, MCAD, MCITP | http://www.ssisbi.com | Please mark the post(s) that answered your question.



