none
how does remove duplicates option in sort transformation behaves?

    Question

  • Hi
    I am currently on a performance tuning task and I am in the process of replacing the existing transformation (especially SORT, aggregate since it is blocking) into a Stored Procedures (SP). I am not getting the exact results as that of the SORT transformation (remove duplicates) when using the methods in the SP to remove duplicates such as DISTINCT clause, HAVING COUNT>1 option, ROWNUMBER() using OVER PARTITION BY ORDER BY clauses.

    Could any of one help me how exactly does the SORT transformation with remove duplicate options work in SSIS?

    Please let me know if you need any more information.

    Thanks in Advance

    Mervyn
    Thursday, November 12, 2009 5:57 AM

Answers

  • Hi
    I am currently on a performance tuning task and I am in the process of replacing the existing transformation (especially SORT, aggregate since it is blocking) into a Stored Procedures (SP). I am not getting the exact results as that of the SORT transformation (remove duplicates) when using the methods in the SP to remove duplicates such as DISTINCT clause, HAVING COUNT>1 option, ROWNUMBER() using OVER PARTITION BY ORDER BY clauses.

    Could any of one help me how exactly does the SORT transformation with remove duplicate options work in SSIS?

    Please let me know if you need any more information.

    Thanks in Advance

    Mervyn

    There's an important thing to understand about using the "remove duplicates" feature of the Sort transform - the results are arbitrary. For example, take the following dataset:

    ID  |   Name
    3    |   Mervyn
    2    |   Bharani
    1    |   Nitesh
    2    |   Jamie

    Now, if you use the Sort transform to sort on ID and then remove duplicates then you *might* end up with the following result:

    ID  |   Name
    1    |   Nitesh
    2    |   Bharani
    3    |   Mervyn

    or, you *might* end up with this:

    ID  |   Name
    1    |   Nitesh
    2    |   Jamie
    3    |   Mervyn

    notice how it arbitrarily picks either "Jamie" or "Bharani" because they both have the same ID. I've never seen any requirement, ever, that says "arbitrarily pick a result, I don't care which one I get".


    I guess the point I'm ultimately trying to make is this - are you *sure* that what you want to do is replicate this behaviour? There is no construct in T-SQL that replicates the arbitrary nature of the Sort transfom (and nor should there be), with T-SQL you absolutely have to tell it whether (in the above example) you would want "Jamie" or "Bharani".

    Hope that helps.

    -Jamie
    http://sqlblog.com/blogs/jamie_thomson/ | http://jamiethomson.spaces.live.com/ | @jamiet
    • Marked as answer by Mervyn_1984 Thursday, November 12, 2009 8:49 AM
    Thursday, November 12, 2009 7:06 AM
    Moderator
  •  Seems like we dont have any control on the data that comes out of SORT transformation with remove duplicates.

    Absolutely, 100%, true.

    http://sqlblog.com/blogs/jamie_thomson/ | http://jamiethomson.spaces.live.com/ | @jamiet
    • Marked as answer by Mervyn_1984 Thursday, November 12, 2009 8:49 AM
    Thursday, November 12, 2009 7:10 AM
    Moderator
  • Do you have the algorithim behind the SORT transformation remove duplicates option ?

    No, this is not publicly available.
    http://sqlblog.com/blogs/jamie_thomson/ | http://jamiethomson.spaces.live.com/ | @jamiet
    • Marked as answer by Mervyn_1984 Thursday, November 12, 2009 8:49 AM
    Thursday, November 12, 2009 7:11 AM
    Moderator

All replies

  • Chk whether this is helpful:
    Duplicate rows are rows with the same sort key values. The sort key value is generated based on the string comparison options being used, which means that different literal strings may have the same sort key values. The transformation identifies rows in the input columns that have different values but the same sort key as duplicates.

    Read more in http://msdn.microsoft.com/en-us/library/ms140182.aspx
    Thanks, Bharani M - Please mark the post as answered if it answers your question. - Please vote the post as Helpful if you find the post as helpful.
    Thursday, November 12, 2009 6:12 AM
  • You can use CTE for removing duplicates.
    I have a table with 2 columns id and id1 in a table a1. In order to remove the duplicate records from this table i will write a CTE like:

    with MyCTE
    as
    (select id,id1,row_number() Over( Partition by id,id1 Order by id) as row
    from a1)
    delete from MyCTE
    where row>1

    I am not sure about the functionality of removing duplicates as far as SORT transform is concerned

    Nitesh Rai- Please mark the post as answered if it answers your question
    Thursday, November 12, 2009 6:17 AM
  • Hi Bharani,

    Thanks for the information. However, I am not sorting with any of the string values but all sort keys are integers. When I tried using the SORT transformation to remove duplicates for a small set of values, it always retained the data that had the highest row_number. But when this logic was implemented in my Stored Procedure, it did not help. Seems like we dont have any control on the data that comes out of SORT transformation with remove duplicates.

    Mervyn
    Thursday, November 12, 2009 6:32 AM
  • Hi Nitesh,

    I have tried this option but still i am getting some differences as opposed to the existing SSIS SORT output. Do you have the algorithim behind the SORT transformation remove duplicates option ?

    Thanks
    Mervyn
    Thursday, November 12, 2009 6:33 AM
  • Mervyn,
    Did you try the CTE example suggested by nitesh?

    There you can have the control on how it is sorted and what you need to delete from the duplicate list with the help of
    row_number() Over( Partition by id,id1 Order by id) as row this custom column.
    Thanks, Bharani M - Please mark the post as answered if it answers your question. - Please vote the post as Helpful if you find the post as helpful.
    Thursday, November 12, 2009 6:34 AM
  • Bharani,

    I did try that. Here i am working to match the results which I am getting from the existing SSIS package. solution provided by Nitesh still had some differences.

    Mervyn
    Thursday, November 12, 2009 6:37 AM
  • Did you tried changing the order by clause in the row_Number() to desc???


    Thanks, Bharani M - Please mark the post as answered if it answers your question. - Please vote the post as Helpful if you find the post as helpful.
    Thursday, November 12, 2009 6:59 AM
  • Hi
    I am currently on a performance tuning task and I am in the process of replacing the existing transformation (especially SORT, aggregate since it is blocking) into a Stored Procedures (SP). I am not getting the exact results as that of the SORT transformation (remove duplicates) when using the methods in the SP to remove duplicates such as DISTINCT clause, HAVING COUNT>1 option, ROWNUMBER() using OVER PARTITION BY ORDER BY clauses.

    Could any of one help me how exactly does the SORT transformation with remove duplicate options work in SSIS?

    Please let me know if you need any more information.

    Thanks in Advance

    Mervyn

    There's an important thing to understand about using the "remove duplicates" feature of the Sort transform - the results are arbitrary. For example, take the following dataset:

    ID  |   Name
    3    |   Mervyn
    2    |   Bharani
    1    |   Nitesh
    2    |   Jamie

    Now, if you use the Sort transform to sort on ID and then remove duplicates then you *might* end up with the following result:

    ID  |   Name
    1    |   Nitesh
    2    |   Bharani
    3    |   Mervyn

    or, you *might* end up with this:

    ID  |   Name
    1    |   Nitesh
    2    |   Jamie
    3    |   Mervyn

    notice how it arbitrarily picks either "Jamie" or "Bharani" because they both have the same ID. I've never seen any requirement, ever, that says "arbitrarily pick a result, I don't care which one I get".


    I guess the point I'm ultimately trying to make is this - are you *sure* that what you want to do is replicate this behaviour? There is no construct in T-SQL that replicates the arbitrary nature of the Sort transfom (and nor should there be), with T-SQL you absolutely have to tell it whether (in the above example) you would want "Jamie" or "Bharani".

    Hope that helps.

    -Jamie
    http://sqlblog.com/blogs/jamie_thomson/ | http://jamiethomson.spaces.live.com/ | @jamiet
    • Marked as answer by Mervyn_1984 Thursday, November 12, 2009 8:49 AM
    Thursday, November 12, 2009 7:06 AM
    Moderator
  •  Seems like we dont have any control on the data that comes out of SORT transformation with remove duplicates.

    Absolutely, 100%, true.

    http://sqlblog.com/blogs/jamie_thomson/ | http://jamiethomson.spaces.live.com/ | @jamiet
    • Marked as answer by Mervyn_1984 Thursday, November 12, 2009 8:49 AM
    Thursday, November 12, 2009 7:10 AM
    Moderator
  • Do you have the algorithim behind the SORT transformation remove duplicates option ?

    No, this is not publicly available.
    http://sqlblog.com/blogs/jamie_thomson/ | http://jamiethomson.spaces.live.com/ | @jamiet
    • Marked as answer by Mervyn_1984 Thursday, November 12, 2009 8:49 AM
    Thursday, November 12, 2009 7:11 AM
    Moderator

  • are you *sure* that what you want to do is replicate this behaviour? There is no construct in T-SQL that replicates the arbitrary nature of the Sort transfom (and nor should there be), with T-SQL you absolutely have to tell it whether (in the above example) you would want "Jamie" or "Bharani".

    Hi Jamie,

    Thanks for the information. I was desperately trying to arrive at a T-SQL that works as identical as that of the Sort transformation remove duplicate option for the last two days. Now that you have confirmed that the results are arbitrary, I am going to drop off my efforts. I have no other option as I have to tune the SSIS package in the same way as it has been functioning before.

    Thanks
    Mervyn
    • Marked as answer by Mervyn_1984 Thursday, November 12, 2009 8:49 AM
    • Unmarked as answer by Mervyn_1984 Thursday, November 12, 2009 8:49 AM
    Thursday, November 12, 2009 8:49 AM

  • are you *sure* that what you want to do is replicate this behaviour? There is no construct in T-SQL that replicates the arbitrary nature of the Sort transfom (and nor should there be), with T-SQL you absolutely have to tell it whether (in the above example) you would want "Jamie" or "Bharani".

    Hi Jamie,

    Thanks for the information. I was desperately trying to arrive at a T-SQL that works as identical as that of the Sort transformation remove duplicate option for the last two days. Now that you have confirmed that the results are arbitrary, I am going to drop off my efforts. I have no other option as I have to tune the SSIS package in the same way as it has been functioning before.

    Thanks
    Mervyn
    Hi Mervyn,
    OK, glad to hear that.

    I'm rather intrigued though. It sounds from what you just said that the Sort transform is giving you the "correct" output but how do you consider it to be correct if it is arbitrary (thus, unpredicatable)?

    I'm interested to know.

    thanks
    Jamie

    http://sqlblog.com/blogs/jamie_thomson/ | http://jamiethomson.spaces.live.com/ | @jamiet
    Thursday, November 12, 2009 10:48 AM
    Moderator
  • Hi Jamie,

    Thanks for all your replies. I got a confirmation from the business people so as to pick the latest record (MAX of row_number ordered by time) for further processing. Now we are working at removing all remove duplicates processing from the application.

    Thanks for your help again.

    Cheers
    Mervyn
    Wednesday, November 18, 2009 5:07 PM
  • Great answer Mervyn_1984 I  had the same issue with Sort and the way it was filtering the data i wasnt convinced,but your answer is abslutely making a lot of sense.great work.
    Tuesday, June 17, 2014 1:51 PM