locked
removing one or more (partially) duplicate rows but keeping one.. RRS feed

  • Question

  • Every week I create a xls dump from a database. This database contains orders including several order-process status columns.

    My goal is to weekly update a PowerPivot data model with a database dump.  

    The 'problem'  is:

    1) A dump  can contain duplicate rows which were already there in the previous week dump

    2) A dump can contain partially identical rows compared to the previous week dump,  only differing in a few order-process status column. In these rows basically a few status columns needs to be updated.

    How can I tackle this?

    Thanks in advance!

    ps i dont have Power Query, so I need a PowerPivot workaround :)

    Tuesday, August 29, 2017 7:09 PM

Answers

  • You've already mentioned that you need a pure PowerPivot workaround, but ideally you should do this sort of data cleansing before loading the data into PowerPivot. So it would be worth checking if you can find a better source for the data or a different way of extracting it.

    But that said, what we need is some way of identifying which of the "duplicates" to keep do you have anything that can be used to identify which version of an order to keep? If you have something like a LastUpdated date and you only wanted to keep the most recent record you could do something like the following:

    =rankx(filter(all(Orders),Orders[OrderNumber] = EARLIER(Orders[OrderNumber])),Orders[LastUpdated])

    This gets all the orders with the same number and ranks them by the LastUpdated column. Then if you called this calculated column "IsLatest" in every measure you would have to filter for IsLatest=1

    eg.

    Order Amount := CALCULATE( SUM( Orders[Amount] ) , Orders[IsLatest] = 1)

    But you would need a sequence number or a last updated date or something like in your data to identify which record to keep. If you have nothing like that and some of your rows are exact duplicates the only other option I can think of is to create a random number column first using =RAND() then rank on that...


    http://darren.gosbell.com - please mark correct answers

    Tuesday, August 29, 2017 10:33 PM

All replies

  • You've already mentioned that you need a pure PowerPivot workaround, but ideally you should do this sort of data cleansing before loading the data into PowerPivot. So it would be worth checking if you can find a better source for the data or a different way of extracting it.

    But that said, what we need is some way of identifying which of the "duplicates" to keep do you have anything that can be used to identify which version of an order to keep? If you have something like a LastUpdated date and you only wanted to keep the most recent record you could do something like the following:

    =rankx(filter(all(Orders),Orders[OrderNumber] = EARLIER(Orders[OrderNumber])),Orders[LastUpdated])

    This gets all the orders with the same number and ranks them by the LastUpdated column. Then if you called this calculated column "IsLatest" in every measure you would have to filter for IsLatest=1

    eg.

    Order Amount := CALCULATE( SUM( Orders[Amount] ) , Orders[IsLatest] = 1)

    But you would need a sequence number or a last updated date or something like in your data to identify which record to keep. If you have nothing like that and some of your rows are exact duplicates the only other option I can think of is to create a random number column first using =RAND() then rank on that...


    http://darren.gosbell.com - please mark correct answers

    Tuesday, August 29, 2017 10:33 PM
  • Hi RJR1000,

    Thanks for your question.

    I would not suggest you to do this in PowerPivot, you should do this kind of thing throuth ETL process berfore loading them into PowerPivot. 

    To do this through SSIS, please refer to below blog:
    http://sqlblog.net/2014/05/01/insert-and-update-records-with-a-ssis-etl-package/

    Best Regards
    Willson Yuan
    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com

    Wednesday, August 30, 2017 6:14 AM