locked
Removing Duplicates Using Sort Transform RRS feed

  • Question

  • There are other threads on this subject but I really haven't seen a definitive answer.

    If my data is sorted in a particular order (in a previous Sort Transform) and I use another Sort Transform to remove duplicates, will the last row for each duplicate (or multiple) be kept?
    Tuesday, July 21, 2009 7:31 AM

Answers

All replies

  • Yes it will be.
    Best way is u could try it out with some dummy data to be sure on ur part.
    Hope this helps !! - Sudeep | Please mark the post(s) as “Answered” that answers your query.
    Tuesday, July 21, 2009 7:32 AM
  • There are other threads on this subject but I really haven't seen a definitive answer.

    If my data is sorted in a particular order (in a previous Sort Transform) and I use another Sort Transform to remove duplicates, will the last row for each duplicate (or multiple) be kept?
    I believe it keeps the first one that it encounters - but don't quote me on that. You should treat it as arbitrary.

    Why are you using two Sort transforms by the way? One should be enough.

    http://jamiethomson.spaces.live.com/ | http://blogs.conchango.com/jamiethomson | @jamiet
    Tuesday, July 21, 2009 7:54 AM
  • I am using the row number to sort first. If I did it all in one Sort Transform, there wouldn't be any duplicates. Mind you, I probably don't need to sort by row number, assuming that the original loading order is maintained throughout the pipeline.

    The real question is which row is kept? It does seem a bit odd that there is no way to specify. I can experiment obviously but I will never be sure what the intended behaviour is.

    An option that I didn't think that I could use is to sort this out in a SQL query on the data source uisng Group By and a "Last" aggregate function that I though I rememberd using somewhwrw. Searching around it appears that this is avaible in MS Access but not T-SQL (surprisingly).

    This apparently a custom component example to do this so perhaps there is no easy answer, which is surprising.

    Tuesday, July 21, 2009 12:50 PM
  • I am using the row number to sort first. If I did it all in one Sort Transform, there wouldn't be any duplicates. Mind you, I probably don't need to sort by row number, assuming that the original loading order is maintained throughout the pipeline.


    Never assume that! It must be explicitly stated and controlled (probaly by using an ORDER BY clause in your original statement. 

    The real question is which row is kept? It does seem a bit odd that there is no way to specify. I can experiment obviously but I will never be sure what the intended behaviour is.

    An option that I didn't think that I could use is to sort this out in a SQL query on the data source uisng Group By and a "Last" aggregate function that I though I rememberd using somewhwrw. Searching around it appears that this is avaible in MS Access but not T-SQL (surprisingly).

    That doesn't surprise me at all. A "Last" function implies that the data exists in a known order and of course one should never assume the order of a dataset when using a "proper" DBMS (I don't refer to Access as a "proper DBMS").  The choice of which row you get out of the deduping Sort transform is arbitarary exactly as it would be in a relational DBMS (and exactly as it should be in my opinion).

    Incidentally even given what I just said I do think there should be a LAST function in T-SQL but only if used in conjunction with the OVER clause.

    If you want to ensure that you get the "first" row (whichever that may be) then you have to order the data in some way which is exactly what you are doing.


    This apparently a custom component example to do this so perhaps there is no easy answer, which is surprising.

    Regards
    Jamie
    http://jamiethomson.spaces.live.com/ | http://blogs.conchango.com/jamiethomson | @jamiet
    Tuesday, July 21, 2009 1:01 PM
  • Thanks, I didn't think that I could use ORDER BY in the source query but now I think that I can.

    The reason was that I was trying to keep the source query generic (using SELECT * etc) so that I could check that the required columns (in Excel files) exist before proceeding.

    I think now that the generic query migh be better done at the Control Flow level (in a For Each File Enumerator). If the required columns don't exist, I can skip the file altogther. This means that when I drop through to the Dataflow level I will know that the anticipated columns exist and can safely use WHERE and ORDER BY.
    Tuesday, July 21, 2009 1:17 PM
  • I haven't tested this myself, but it's somewhere in a conversation I had with another fellow who was asking basically the same question.  He did some experimenting and found that although it appeared as though the Sort component took the "last" copy of a particular "duplicate", that was actually not the case.  That was usually the case - except for rows where duplicates spanned buffers.  In those cases, something different happened, and the last row was NOT the one passed through.

    Although I can't recall the specifics - the lesson is clear (as previously stated by Jamie) you can NOT rely on the Sort transformation picking the passed through values consistently.  The SSIS team (knowingly or unknowingly) omitted a commitment to which row's contents will be passed through.  This gives them the freedom to completely alter that algorithm any time they want to for whatever reason.  Murphy's Law will guarantee they'll do so if you count on that behaviour.

    Picking a specific row is as easy as Multicasting your flow, sending one output to an Aggregate to pick the MAX (or MIN) row number for each "key", then Merge Joining back together based on the row number.  This does require use of several blocking components - but you're already using some, so tough... :)
    Todd McDermid's Blog
    Tuesday, July 21, 2009 5:18 PM
  • Although I can't recall the specifics - the lesson is clear (as previously stated by Jamie) you can NOT rely on the Sort transformation picking the passed through values consistently.  The SSIS team (knowingly or unknowingly) omitted a commitment to which row's contents will be passed through.  This gives them the freedom to completely alter that algorithm any time they want to for whatever reason.  Murphy's Law will guarantee they'll do so if you count on that behaviour.

    Todd McDermid's Blog

    I agree. You can do all the testing you want but you will never be sure about the result unless an official statement on what to expect.

    I am going to see what I can do with teh SQL Command on the Excel Source. It's not as simple as using Min/Max because the decision is more about the row order. I haven't used Rank and/or Partition before and I don't know whether they are even available in an Excel Source Command but there seem to be suggestions that these might be worth looking at.

    The requirement is that, if like keyed rowa are found in an Excel file, the last row should be kept.

    It would be usefull if the Sort Transform allowed you to specify whether the first or last row should be kept when removing duplicates. I would get a little complicated because you would also ned to specify a secondary sort order to evaluating first and last.

    Let's face, Excel and text files present some real challenges as a method of data input.
    Tuesday, July 21, 2009 8:10 PM
  • They do.

    You can use a Script component to add row numbers, or use the Row Number transform to accomplish that as well (in order to get the Min/Max within the Data Flow method I described to work for you).
    Todd McDermid's Blog
    Tuesday, July 21, 2009 8:47 PM
  • You can use a Script component to add row numbers, or use the Row Number transform to accomplish that as well (in order to get the Min/Max within the Data Flow method I described to work for you).
    Todd McDermid's Blog

    I took your advice and used a Transform Script to generate row numbers, immediately after the file was loaded.

    I then use Sort Transform to sort by the key fields (ascending) and row number (decending). Next I use another simple Transform Script to add a new column call "Duplicate" which as set to true or false dependng on whether the previous had the same key or not. Lastly, I use a Conditional Split to redirect (and log) the duplicates.

    This gives me control over which of the duplicate rows are kept.
    Wednesday, July 22, 2009 10:11 AM
  • I forgot to point you to a post I'd written that has a bunch of options laid out - including (I think) everything we'd talked about here!  Eliminating Duplicate Primary Keys in SSIS.
    Todd McDermid's Blog
    • Marked as answer by Tony Tang_YJ Friday, July 31, 2009 9:11 AM
    Wednesday, July 22, 2009 3:31 PM