none
Problem with Merge Join and condition splt trasnformations in SSIS 2008 RRS feed

  • Question

  •  

    Hey,

    While working SSIS in 2008 we have encountering some weird problems with Merge, Merge Join and Condition Spilt transformations, here are the details...

    Condition Split and Merge Join and Condition Split:

    Please check the below diagram(not really just tried) for the complete details.

    Just FYI data is in sorted order

     

    condition Split -> 420000

         | 18                                                          | 419982

     Merge Join(left)  -- src1(6000)

         |10

     Condition Split

         |9     |1

     

    the merge join transformation is not returning the complete 18 rows to the next transformation that's why we are missing 8 rows.

    But if we adds a sort transformation (getting warning as data is already sorted) before merge then complete 18 rows were passing to the condition split.

    Also tried modifing the source query just to return 18 rows then also it was passing the 18 rows to next trasnformation except with full load.

     

     

     

    Merge:

    We have two sorted data sources and first one has 40000 records and second source have 12000 records and after the Merge transformation we have Condition Split transformation, the condition Split transformation is completed before getting all the records from merge that's why we are missing some  1000 records.

     

     

    Does any one faced this type of problems with Merge and Merge Join in 2008?? If so whats the solution for this??

    Is this a bug in SSIS 2008??

     

    Thanks in advance..

     

    -S hiv.

     

     

    Thursday, April 8, 2010 5:02 PM

All replies

  • It's a little hard to understand what you're describing, but one part that I can extract is that your Merge Join is behaving differently in the case where you use a Sort component versus when you don't.

    The Merge Join requires sorted inputs (you obviously know that), and to do away with the requirement to use a Sort component, you need to manipulate the IsSorted and SortKeyPosition properties on the source you're extracting the data from (you also know that).

    However - you have to be VERY sure that the sorting mechanism that your source system is using is identical to the sort order that SSIS uses.  Why?  Because you're telling SSIS that this is the case, and if it isn't, you'll get odd results - like you are.  It could be collation, case handling, special characters, ...

    You may be experiencing something like this: Does SSIS knows how to sort a flow of data? Well… no  I personally have not observed this, but I trust Alberto.  He knows of what he speaks.


    Todd McDermid's Blog
    Thursday, April 8, 2010 5:20 PM
    Moderator
  • Hi Todd,

    Thanks for the Quick Response.Before Explaining my situation here are the quick notes

    1. The sort columns we are using are numeric and i don't think Collation/case handling will be the cause for the problem

    2. The joined columns were sorted and also issorted and sortkey properties were also properly set on source

    3.  The same logic is working fine if we use default maxbuffer rows for DFT to 50000 with out additional sort trasnformation between conditonal split and merge transformation.

     

    Here's the problem description

    Src1 with 42000 records

              |

    Sort Transformation on numeric colns

              |

    Merge left Join ---- src2  with 1000 records

             | 420000 records after merge left join result

    condition split------cond1 ----------419982 records

             | cond2 18 records

    Merge left join ------Src3 with 6000 records

             | return 10 records here rather than 18 records

     

    Please let me know if you have problem understanding the problem, i will send you the screen shot..

    Just FYI I have tried to attach the screen shot but i didnt find any option here...

     

    Thanks,

    -S hiv.

      

    Thursday, April 8, 2010 6:38 PM
  • Sorting on numbers shouldn't introduce that kind of issue, I agree.

    The issue here is that the Merge Join is failing to join columns because of the sort order.  The only way it would do so is if the sort order was different between what SSIS does with the Sort component, and what your source does with it's algorithm.

    What I'd do is:

    1. Send the flow out of your source into a Row Number component (or script your own) so that you get an ordinal number for each row, according to what the source thinks.  Call the column with the row number "SourceRowNumber".

    2. Sort the flow with a Sort component, and add another Row Number component to add another column - "SSISRowNumber".

    3. Use a Conditional Split component to split out the rows where [SourceRowNumber] != [SSISRowNumber].  That might give you a clue as to what's messed up.


    Todd McDermid's Blog
    Thursday, April 8, 2010 8:21 PM
    Moderator
  • Hey Todd,

     I have tested the same scenario and he process didn't returned any rows to out put becoz they both are same.

     

    I dont think the problem I have is becoz of the sorted order if that is the case the process wont sends full 18 records to next transformation when I increased the DFT default maxbuffer rows to 50000 (please see 3rd point on my previous post).

    at this moment I dont have any idea but it might be some problem with merge join in SSIS 2008...

     

    Please advice...

     

    Thanks,

    -S iva.

     

     

    Thursday, April 8, 2010 10:25 PM
  • There's no problem with the Merge Join.  The only problems I've ever seen associated with it are due to incorrect (but understandable) assumptions as to how it matches rows.

    But it looks like I misinterpreted your real issue.  I think I read that you've checked that the sort order from your source matches the Sort transform's sort order.

    If I can restate your problem to ensure I understand it a little better now.  The flow works as you expect - without an "extra" Sort - if you change the DefaultBufferMaxRows to 50000 instead of the default 10000.  Is that correct?

    When you say that you only get 10 rows out of the Merge Join instead of the expected 18, has the Data Flow completely finished?  What indication are you looking at to see that the Data Flow is complete?


    Todd McDermid's Blog
    Thursday, April 8, 2010 11:13 PM
    Moderator
  • Hey Todd,

     

    Good Morning.

     

    yeah you are right with out extra sort the flow works as expected with the change in default buffer max rows.

     

    The data flow is completely finished, we know this becoz the process notifies when the DFT successfully completes, also I have traced the flow with the Data viewers and I have noticed merge join completion before sending complete 18 rows.

     

    Let me know if you need additional information.

     

    Thanks,

    -S iva.

     

    Friday, April 9, 2010 1:38 PM
  • If you introduce the Sort transform, does it work at both buffer levels?

    Do you know which eight records are not being passed through?  Are they rows where you expect a match with the right side, or not?  When the flow "works", how many rows find a match, and how many do not?

    The last Merge Join is a Left Join, with the 18 record flow being on the "left"?

    Unless those questions trigger something new, I'm at a loss.  I've never seen or heard of the Merge Join behaving differently at different buffer levels.  The only issues I've ever heard of with the Merge Join have always been traced back to sort order, or case- or space-sensitivity.  And in those cases, in Left Join conditions, it's always passed through a complete count of rows from the left side - it's only been the matching with right side rows that's been the difficulty.  If those questions don't open up any new ideas, is there any "company sensitive" data in that package?  Would you be able to send it to me so I could look at it?


    Todd McDermid's Blog
    Friday, April 9, 2010 2:28 PM
    Moderator
  • Hey Todd,

     

    Here's the more information for your questions....

    1. If you introduce the Sort transform, does it work at both buffer levels?

    yeah if introduce sort before merge it works as expected at both buffer levels

    2. Do you know which eight records are not being passed through?  Are they rows where you expect a match with the right side, or not?  When the flow "works", how many rows find a match, and how many do not?

    yeah we know those eight records and those eight records actually passed through the last two buffers( 4 & 4) when use deafultmaxrowbuffer.  Since its a left join we dont care whether we have match with right side or not becoz all the rows from the left hand side has to passed to next trasnformation.

    3. The last Merge Join is a Left Join, with the 18 record flow being on the "left"?

    yeah its left join and all the 18 records are from left only

     

    Merge and Merge Join acts totally differently basing on Buffer sizes and we have notices couple of places....

    If you want we can send the test package which we have created for merge and condition splt test for this problem, please share your email address so that we will send the package..

     

    Thanks,

    -S hiv.

     

    Friday, April 9, 2010 3:20 PM
  • Hey Todd,

     

    Here's the more information for your questions....

    1. If you introduce the Sort transform, does it work at both buffer levels?

    yeah if introduce sort before merge it works as expected at both buffer levels

    2. Do you know which eight records are not being passed through?  Are they rows where you expect a match with the right side, or not?  When the flow "works", how many rows find a match, and how many do not?

    yeah we know those eight records and those eight records actually passed through the last two buffers( 4 & 4) when use deafultmaxrowbuffer.  Since its a left join we dont care whether we have match with right side or not becoz all the rows from the left hand side has to passed to next trasnformation.

    3. The last Merge Join is a Left Join, with the 18 record flow being on the "left"?

    yeah its left join and all the 18 records are from left only

     

    Merge and Merge Join acts totally differently basing on Buffer sizes and we have notices couple of places....

    If you want we can send the test package which we have created for merge and condition splt test for this problem, please share your email address so that we will send the package..

     

    Thanks,

    -S hiv.

     

    Hey,

     

    Any one have same problems with Merge Join ???

     

    -S hiv.

     

    Monday, April 12, 2010 1:34 PM
  •  

    Go to my CodePlex contact page for my email: http://www.codeplex.com/site/users/view/toddmcdermid


    Todd McDermid's Blog
    Monday, April 12, 2010 2:35 PM
    Moderator
  • Hey Todd,

     

    I have sent the test mail through codeplex, send me reply with the email id so that i can send the pkg.

     

    -S hiv.

    Tuesday, April 13, 2010 3:42 PM
  • before using merge join or merge the data must be in sorted format.. so try to sort first and then give that as input to the merge join or merge.. if u want 2 join 2 tables which has primary, foreign key relationship then go for merge join.. n speccify thye kind of join u want.. if u jus want 2 merge 2 input files or some thing go with merge...
    Monday, April 19, 2010 10:05 AM
  • Hi Todd,

    I too got the same senario of the Merge Join - while joining I am missing some of the columns from the left source. I have implemented IsSort option on right source and already sorting data from left source data, finally for testing when I used the external Sort component I got the expected join data.

    I feel the SSIS is not logically catching the order of (ASC or DESC) of the source from right and left components for join the sources.

    plea reply

    thanks

    prav

    Thursday, September 9, 2010 1:25 AM
  • That's nice that you "feel" that way - but feeling that way isn't proof of anything.

    The Merge Join in SSIS works perfectly, as advertised.  Unfortunately, I didn't get KewlShiva's package to take a look at (that I recall).

    I suggest you review your package more thoroughly.  Specifically, that BOTH inputs are actually sorted - by the SAME ENGINE, and are correctly labeled as sorted.  If you have one input sorted by SSIS, and another by SQL, or both by SQL, but from different servers, you could have mismatched collations.  If you've used ORDER BYs, you may have misconfigured the SortKeyPositions.

    Again, I'm assuming you know what you're talking about - because "missing columns" is not what this thread is about.  It's about mismatching rows.


    Todd McDermid's Blog Talk to me now on
    Thursday, September 9, 2010 3:43 PM
    Moderator
  • Hi Todd,

    Great to see your reply as I am big fan of your posts and comments.

    As you said I did all the checkings on sorting columns, nothing is wrong with Merge Join component but SSIS is not having ASC or DESC IsSorted property in advanced properties of components, so we can't get the order of columns as we expect in SSIS where it is not possible to use the order of columns in the query.

    I am not a big fan of Merge Join due to the use of expensive SORT component in buffer, I have changed my business implementation to avoid performance hit.

    however, happy to got reply from you.

    thank you very much

    prav

    Friday, September 17, 2010 2:07 PM
  • If you are talking about telling the engine that a source is sorted, for Asc the sort order is positive, for descending it is negative.  Check the B.O.L. entry for how to sort data for merge and merge join .
    Please mark answered posts. Thanks for your time.
    Friday, September 17, 2010 2:34 PM
    Moderator
  • We actually have a very similar problem, our right side merrily spools it's 4 million records, while left side process it's 6k and join reports itself as finished with 6k.

    So far it's exactly like yours, it looks weird though in debugger, because the left side keeps spooling records to merge component, but it's already green and reporting finished processing.

    To make matter more weird, it's only on servers with sp1, new sp2 applied servers don't seem to have this problem. So yeah, looks like a bug.

    Wednesday, February 23, 2011 3:31 PM
  • Hey Todd,

     

    I have sent the package to your email id, Just saw the notes from the thread today and re sent the package  let me know if you didn't received the package.

     

    FYI...I have changed the complete data flow to over come the issue.

     

    Thanks,

    -Shiv

    Thursday, August 4, 2011 7:36 PM
  • Thats true Dmitriy. :)

     

    -Shiv.

    Thursday, August 4, 2011 7:37 PM
  • Kewlshiva,

        How did you ever resolve your issue? I'm dealing with similar issue but it only occurs if I have a large data set in my data flow(more than 100K rows). Also if I use the sort task rather than sorting the data source, the issue doesn't occur.

    Wednesday, August 31, 2011 8:17 PM
  • Hey Dmitrity ,I am getting the same issue. I'm using SQL Server 2008 R2. Is there a sp2 for this version of sql.
    Wednesday, August 31, 2011 8:23 PM
  • I recognize this issue, I had this in SQL2005 as well!
    The fix is to use a sort component before the merge (instead of using the order by in you query and sort setting in advanced properties).

    However I use both, so order by statement, advanced properties and set the sorted settings + an extra sorted component.

    I have noticed this in a project that ran perfectly for a couple of months and after the table exploaded because the component couldn't match / load records (like you descibed in your situation). After  adding the extra sort, it worked.

    Thursday, September 1, 2011 2:02 PM
  • Bram, THe object is to not use the Sort component to allow the package to perform more efficient. We only have this problem when there is a large amount of rows.

    Friday, September 2, 2011 3:04 PM