none
Merge Join vs. Lookup vs. Custom Script - which is fastest?

    Question

  • Very often we have 50'000+ rows which you need to pull values from different source (e.g. CityNames from citycode in Excel file). Currently we are using Lookup - but the questions is which of those 3 options is best in performance wise?
    Friday, October 20, 2006 2:53 PM

Answers

  • So here we go:
    I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
    The DB is accessed over the LAN.

    Test1 (Lookup):
    Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
    Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
    2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
    Also lookup is Full Cache mode and gives Warning: found duplicate key values.
    Result:
    Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
    Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.

    Test 2 (Merge Join):
    1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
    2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
    1st source is Sorted using "Sort transform".
    Then "Merge Joined" with ole db via Left outer join (Sort on left)
    Then "Conditional Split" based on ISNULL(oledbsource.ID)
    Result:
    Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
    Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.

    Test3 (Script Transform) -
    Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
    Script transform to do a lookup based on key column for each row in pipeline.
    Result:
    Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
    Note: Memory usage was stable around 1GB and CPU near 5% usage

    My Conclusion:
    Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
    Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
    Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
    So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DB

    Test 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
    Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm
    Sunday, October 22, 2006 10:11 PM
  • Have you run your own tests?

    Take your source and throw it through each of the above options and finally into a row counter.  Compare the time it takes to get through the whole dataflow.
    Friday, October 20, 2006 3:05 PM
  •  TheViewMaster wrote:
    Thanks guys for your answers - I will try it out performance testing this weekend on my free time.
    So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?

    OK. Well I am loath to give my opinions on performance comparisons but I'd lay alot of money to say that script transform will be slowest.

     

    -Jamie

     

    Friday, October 20, 2006 3:41 PM
  • The script component cannot, natively, do "caching".

    The script processes row by row and what your code did was for each row, fire a select at SQL.

    A factor or this performance would be, what is your plan like? Does it use an index?

    A closer comparison would be to, on pre execute, load an array of all the records from your table.

    This will act more like the lookup where, assuming you getting everything, indexing is not as  important as it when running thousands of selects.

    Once you have your "cache", you can lookup up each row from that rather than from SQL.

    Likewise, disabling caching on the lookup is the same as the test you did. For each row, go get it from SQL. This will never be a fast operation when you talking thousands of lookups...

     

    As Jamie said, I too would put money on the fact that it would be hard to make the script task as quick as a fully cached lookup.The lookup is designed for one thing only.

     

    Also, your ideas on getting SQL to sort the data are right. If you can do it there, do it...

     

     

    Wednesday, October 25, 2006 2:10 PM

All replies

  • Have you run your own tests?

    Take your source and throw it through each of the above options and finally into a row counter.  Compare the time it takes to get through the whole dataflow.
    Friday, October 20, 2006 3:05 PM
  •  TheViewMaster wrote:
    Very often we have 50'000+ rows which you need to pull values from different source (e.g. CityNames from citycode in Excel file). Currently we are using Lookup - but the questions is which of those 3 options is best in performance wise?

     

    Only you can answer that question. test and measure test and measure, test and emasure.

     

    -Jamie

     

    Friday, October 20, 2006 3:10 PM
  • Thanks guys for your answers - I will try it out performance testing this weekend on my free time.
    So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?
    Friday, October 20, 2006 3:31 PM
  •  TheViewMaster wrote:
    Thanks guys for your answers - I will try it out performance testing this weekend on my free time.
    So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?

    OK. Well I am loath to give my opinions on performance comparisons but I'd lay alot of money to say that script transform will be slowest.

     

    -Jamie

     

    Friday, October 20, 2006 3:41 PM
  • If you do test the three methods, please post the results here. I am using custom script for lookups (small reference lists but millions of source rows in pipeline) but I would like to know how large reference lists perform.

     

    Friday, October 20, 2006 8:23 PM
  • For those posting to this thread and reading it, please watch the Webcast presented by Donald Farmer on performance and scale in SSIS.  In there Donald talks about benchmarking and how to set up SSIS to obtain timings associated with different aspects of a package.

    TechNet Webcast: SQL Server 2005 Integration Services: Performance and Scale (Level 400)
    MS TechNet Event ID: 1032298087

    I don't know if this link will work for anyone:

    https://msevents.microsoft.com/CUI/Register.aspx?culture=en-US&EventID=1032298087&CountryCode=US&IsRedirect=false

    Friday, October 20, 2006 8:35 PM
  • Where can I report a BUG about this forum - I have a 50/50 chance that when I try to create a hyperlink in my post - the Firefox crashes.
    (Thank god I copied and pasted the following post to notepad before "doing the hyperlink trick")
    Sunday, October 22, 2006 9:51 PM
  • So here we go:
    I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
    The DB is accessed over the LAN.

    Test1 (Lookup):
    Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
    Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
    2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
    Also lookup is Full Cache mode and gives Warning: found duplicate key values.
    Result:
    Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
    Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.

    Test 2 (Merge Join):
    1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
    2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
    1st source is Sorted using "Sort transform".
    Then "Merge Joined" with ole db via Left outer join (Sort on left)
    Then "Conditional Split" based on ISNULL(oledbsource.ID)
    Result:
    Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
    Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.

    Test3 (Script Transform) -
    Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
    Script transform to do a lookup based on key column for each row in pipeline.
    Result:
    Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
    Note: Memory usage was stable around 1GB and CPU near 5% usage

    My Conclusion:
    Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
    Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
    Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
    So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DB

    Test 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
    Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm
    Sunday, October 22, 2006 10:11 PM
  •  TheViewMaster wrote:
    So here we go:
    I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
    The DB is accessed over the LAN.

    Test1 (Lookup):
    Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
    Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
    2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
    Also lookup is Full Cache mode and gives Warning: found duplicate key values.
    Result:
    Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
    Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.

    Test 2 (Merge Join):
    1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
    2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
    1st source is Sorted using "Sort transform".
    Then "Merge Joined" with ole db via Left outer join (Sort on left)
    Then "Conditional Split" based on ISNULL(oledbsource.ID)
    Result:
    Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
    Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.

    Test3 (Script Transform) -
    Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
    Script transform to do a lookup based on key column for each row in pipeline.
    Result:
    Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
    Note: Memory usage was stable around 1GB and CPU near 5% usage

    My Conclusion:
    Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
    Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
    Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
    So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DB

    Test 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
    Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm

     

    Excellent stuff. This is really valuable information. Thank you. I've updated my post with a link to here.

    Monday, October 23, 2006 1:58 PM
  • Yes thanks for posting very interesting info.  Today I am going to change all my script lookups to use stringbuilder class and methods (strongly recommended in all the .net literature where performance is important when modifyng strings). Currently all my lookup script transforms use object based .net string variables which are notoriously terrible performers when the string values are repeatedly modified.  Do you know which approach your script transform used?  (assuming your are creating and modifying string variables in your lookup script)...

     

    If I detect the same low processor usage in my script lookups I may also try and partition the pipeline to get a lookup to run with multiple threads...

     

    Ken

    Monday, October 23, 2006 6:32 PM
  • My script does a lookup something similar to as described aforementioned Van Mullem article:

    Public Overrides Sub PreExecute()

            sqlCmd = New SqlCommand("SELECT KeyCustomer, CustomerName FROM tblCustomer WHERE(KeyCustomer = @KeyCustomer)", sqlConn)

            sqlParam = New SqlParameter("@KeyCustomer", SqlDbType.Int)

            sqlCmd.Parameters.Add(sqlParam)

        End Sub

        Public Overrides Sub CustomerRecordsInput_ProcessInputRow(ByVal Row As CustomerRecordsInputBuffer)

            Dim reader As SqlDataReader

            sqlCmd.Parameters("@KeyCustomer").Value = Row.CUNO

            reader = sqlCmd.ExecuteReader()

            If reader.Read() Then

                    Row.DirectRowToUpdateRecordsOutput()

            Else

                Row.DirectRowToInsertRecordsOutput()

            End If

            reader.Close()

        End Sub


    Wednesday, October 25, 2006 4:18 AM
  • Ken - is your script performing a lookup from another source in pipeline?
    <boy i'd like to know how to do that>

    Also - any suggestions how to improve performance of OLE DB Update command?
    Wednesday, October 25, 2006 4:21 AM
  • Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.

    A non cached lookup will be expremely slow as was your script component.

     

     

    Wednesday, October 25, 2006 9:20 AM
  •  Crispin wrote:

    Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.

    A non cached lookup will be expremely slow as was your script component.



    It would be best to try to replicate full caching in the script component.  The purpose of the exercise was to see which was faster.  So, we know how fast (and legitimately so) the lookup component was, now how fast can we get the script component to process? 

    The question is how fast can each of the elements process their data, not how slow can we make them work.

    Phil
    Wednesday, October 25, 2006 1:22 PM
  •  Phil Brammer wrote:
    Crispin wrote:

    Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.

    A non cached lookup will be expremely slow as was your script component.



    It would be best to try to replicate full caching in the script component. The purpose of the exercise was to see which was faster. So, we know how fast (and legitimately so) the lookup component was, now how fast can we get the script component to process?

    The question is how fast can each of the elements process their data, not how slow can we make them work.

    Phil

    I appreciate your feedback - when I did the tests I was not aware of how to enable caching for script component (btw - how would you do it?). I did spend some time optimizing the merge join (by marking it dataset from db sorted - sort transforms as you know are costly).

    So my script component was "straight out of the box" solution as described in Van Mullen article. So I still stick with my verdict that Lookup is fastest and also it is easiest to setup. Still my main performace concern is now down the pipeline - the OLE DB Update Command
    Wednesday, October 25, 2006 1:41 PM
  • The script component cannot, natively, do "caching".

    The script processes row by row and what your code did was for each row, fire a select at SQL.

    A factor or this performance would be, what is your plan like? Does it use an index?

    A closer comparison would be to, on pre execute, load an array of all the records from your table.

    This will act more like the lookup where, assuming you getting everything, indexing is not as  important as it when running thousands of selects.

    Once you have your "cache", you can lookup up each row from that rather than from SQL.

    Likewise, disabling caching on the lookup is the same as the test you did. For each row, go get it from SQL. This will never be a fast operation when you talking thousands of lookups...

     

    As Jamie said, I too would put money on the fact that it would be hard to make the script task as quick as a fully cached lookup.The lookup is designed for one thing only.

     

    Also, your ideas on getting SQL to sort the data are right. If you can do it there, do it...

     

     

    Wednesday, October 25, 2006 2:10 PM
  • yes. I open a text file in the preexecute phase of the script, read the contents into a dictionary or collection depending on how many values I want to return from the lookup, and reference the collections values as the reference table part of the lookup process. Seems to work great so far.

     

    Wednesday, October 25, 2006 6:26 PM
  • As far as improving perfomance of OLEDB update command, I would bet you are sending a sql statement to the back-end for each row passing through the component.  I would not expect this approach to perform well for a lot of rows.  I believe you would have much better performance if you can do your work with set based sql somehow.
    Wednesday, October 25, 2006 6:39 PM
  • Hmm, if I populate a .net collection or dictionary in the preexecute phase of a script with the values I want to use as the "reference" list for my lookups, and then use the inputprocess row subroutine to find a match to the reference list for a column value in each row passing through the pipeline, wouldn't that reference list by definition be "cached"?  The .net collection or dictionary is static throughout the processing in the pipeline and is in memory?

     

    Wednesday, October 25, 2006 7:35 PM
  •  Phil Brammer wrote:
     Crispin wrote:

    Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.

    A non cached lookup will be expremely slow as was your script component.



    It would be best to try to replicate full caching in the script component.  The purpose of the exercise was to see which was faster.  So, we know how fast (and legitimately so) the lookup component was, now how fast can we get the script component to process? 

    The question is how fast can each of the elements process their data, not how slow can we make them work.

    Phil

     

    Took the words right out of my mouth!

     

    -Jamie

    Wednesday, September 05, 2007 4:51 PM