none
Fastest way of removing "duplicates" from a List<MyObject> RRS feed

  • Question

  • I've broken my code out into this small sample for simplicity. I have a List<MyObject> and I would like to remove these so called duplicates from it in the fastest way possible b/c I have an enormous data set. Anyhow, here is an example showing MyObject and a populated List<MyObject>.

    public partial class MainWindow : Window
      {
        public MainWindow()
        {
          InitializeComponent();
    
          List<MyObject> myObjectList = new List<MyObject>();
    
          myObjectList.Add(new MyObject() { Point = new Point(-1, 0), Value = 56 });
          myObjectList.Add(new MyObject() { Point = new Point(-1, 1), Value = 45.8 });
          myObjectList.Add(new MyObject() { Point = new Point(-1, 2), Value = 49.6 });
          myObjectList.Add(new MyObject() { Point = new Point(0, 0), Value = 39.6 });
          myObjectList.Add(new MyObject() { Point = new Point(0, 1), Value = 45.98 });
          myObjectList.Add(new MyObject() { Point = new Point(0, 2), Value = 34.45 });
          myObjectList.Add(new MyObject() { Point = new Point(1, 0), Value = 39.23 });
          myObjectList.Add(new MyObject() { Point = new Point(1, 1), Value = 41.11 });
          myObjectList.Add(new MyObject() { Point = new Point(1, 2), Value = 40.2 });
          myObjectList.Add(new MyObject() { Point = new Point(2, 0), Value = 29.89 });
          myObjectList.Add(new MyObject() { Point = new Point(2, 1), Value = 30.67 });
          myObjectList.Add(new MyObject() { Point = new Point(2, 2), Value = 40.1 });
          myObjectList.Add(new MyObject() { Point = new Point(0, 1), Value = 60.18 });
          myObjectList.Add(new MyObject() { Point = new Point(1, 2), Value = 27.2 });
        }
    
        public class MyObject
        {
          public Point Point { get; set; }
          public double Value { get; set; }
        }
      }
    

    Now, I would like to remove what I call the kind-of-duplicates , meaning I want to remove any MyObject from the list whose Point value is duplicated. Thus, from the list you'll see the last two MyObjects have duplicate point values. I want to remove the MyObject whose Value is the smallest. Going by that philosophy, the MyObject with a Point(0,1) and Value of 45.98 should be removed. Also, the MyObject with a Point(1,2) and Value of 27.2 should be removed.

    I would like to do this as fast as possible since my real data set is huge. I imagine Linq will lead the way? Any suggestions are much appreciated. Thanks.

    Wednesday, December 22, 2010 7:10 PM

Answers

  • You can sort them first, then keep only the highest Value per Point:

      myObjectList = (
        from o in myObjectList
        orderby o.Point.X, o.Point.Y, o.Value descending
        group o by o.Point into g
        select g.First()
      ).ToList();
    

    Wednesday, December 22, 2010 10:21 PM
  •   myObjectList.Sort((o1, o2) =>
      {
        if (o1.Point.X > o2.Point.X)
          return 1;
        else if (o1.Point.X < o2.Point.X)
          return -1;
        else
          return o1.Point.Y - o2.Point.Y;
      });
    
    Wednesday, December 22, 2010 9:44 PM
  •   Dictionary<Point, MyObject> already = new Dictionary<Point, MyObject>();
      foreach (var item in myObjectList)
      {
        if (already.ContainsKey(item.Point) && already[item.Point].Value >= item.Value)
          continue;
        already[item.Point] = item;
      }
      myObjectList = new List<MyObject>(already.Values);
    
    • Marked as answer by BBauer42 Wednesday, December 22, 2010 8:55 PM
    Wednesday, December 22, 2010 8:52 PM

All replies

  • If your real data is huge it may be impractical to sort in memory. you can use shell sort or sort using a database with the help of a multi-column index. If a database is used, you can use linq to issue the query.



    The following is signature, not part of post
    Please mark the post answered your question as the answer, and mark other helpful posts as helpful, so they will appear differently to other users who are visiting your thread for the same problem.
    Visual C++ MVP
    Wednesday, December 22, 2010 7:18 PM
  • I'm not trying to sort, I'm trying to remove duplicates. And yes, it has to be done in memory. I already have the large List<MyObject> in memory and need to remove the duplicates as detailed above. How can I do so in the most efficient manner for this specific scenario? Thanks.
    Wednesday, December 22, 2010 7:30 PM
  • This should do it:

     

     for (int i = 0; i < myObjectList.Count; i++)
       {
        Point _point = myObjectList[i].Point;
        for (int j = 0; j < myObjectList.Count; j++)
        {
         Point _point2 = myObjectList[j].Point;
         if (i != j)
         {
          if (_point == _point2)
          {
           int index = myObjectList.FindLastIndex(a => a.Point == _point2);
           myObjectList.RemoveAt(index);
          }
         }
        }
       }
    

     

     

    hope it helps,

    Mitja

    Wednesday, December 22, 2010 7:39 PM
  • Removing an item from the list is probably more expensive then rebuilding a new list
    from scratch - without the duplicates. See the remarks section of List<T>.RemoveAt .

    I'd recommend to walk the list, track the point-values in hashset
    and copy those who not yet actually exist into  the new list.
    Thus you have an O(n) performance, where as a solution that compares every item
    with every other item gets close to an O(n²) performance.


        static List<MyObject> RemoveDuplicates(IList<MyObject> originalList)
        {
          HashSet<Point> set = new HashSet<Point>();
          List<MyObject> cleanedList = new List<MyObject>(originalList.Count);
    
          foreach (MyObject obj in originalList)
          {
            if (!set.Contains(obj.Point))
            {
              set.Add(obj.Point);
              cleanedList.Add(obj);
            }
          }
          return cleanedList;
        }




    Chris
    Wednesday, December 22, 2010 8:02 PM
  •   Dictionary<Point, MyObject> already = new Dictionary<Point, MyObject>();
      foreach (var item in myObjectList)
      {
        if (already.ContainsKey(item.Point) && already[item.Point].Value >= item.Value)
          continue;
        already[item.Point] = item;
      }
      myObjectList = new List<MyObject>(already.Values);
    
    • Marked as answer by BBauer42 Wednesday, December 22, 2010 8:55 PM
    Wednesday, December 22, 2010 8:52 PM
  • Mitja, thanks, your method works but is not as fast as Louis.fr's method.

    Chris I tried yours as well but it did not remove any duplicates and I am happy with Louis's solution.

    Thanks everyone!!!

    Wednesday, December 22, 2010 8:56 PM
  • As a follow up, how can I sort the list based on the points?

    I would like them to be of this order after sorting:

    Point(0,0)

    Point(0,1)

    Point(0,2)

    Point(1,-1)

    Point(1, 0)

    Point (1,1)

    Point(2,0)

    Point(2,1)

    .......

    Wednesday, December 22, 2010 9:17 PM
  •   myObjectList.Sort((o1, o2) =>
      {
        if (o1.Point.X > o2.Point.X)
          return 1;
        else if (o1.Point.X < o2.Point.X)
          return -1;
        else
          return o1.Point.Y - o2.Point.Y;
      });
    
    Wednesday, December 22, 2010 9:44 PM
  • Mine returns a new list, the original remains untouched,
    So you have to assign it to some variable, in order to see that it works as expected
    - which you probably didn't ;-)

    So here is a benchmark that show you the performances of the three solutions,
    actually mine and Loius are equally fast, which doesn't suprise a lot since it's almost the same logic
    and the same algorithm.

    Chris

    using System;
    using System.Collections.Generic;
    using System.Diagnostics;
    using System.Drawing;
    
    namespace ListRemovalBenchmarkConsoleApp
    {
      class Program
      {
        private delegate void RemoveAction(ref List<MyObject> list);
    
        private static Random random = new Random();
    
        private const int ListSize = 100000;
        private const int DuplicateSize = 1000;
        
        static void Main(string[] args)
        {
    
          //create a list with random object
          List<MyObject> myObjectList1 = new List<MyObject>();
          Console.WriteLine("Start object generation");
          for (int i = 0; i < ListSize; i++)
          {
            myObjectList1.Add(RandomMyObject());
          }
    
          //Insert duplicates at random positions
          for (int i = 0; i < DuplicateSize; i++)
          {
            double d = random.NextDouble() * myObjectList1.Count;
            MyObject newObject = Clone(myObjectList1[(int)d]);
            int newIndex = (int)(random.NextDouble() * myObjectList1.Count);
            myObjectList1.Insert(newIndex, newObject);
          }
    
          //clone the lists, one for each benchmark
          var myObjectList2 = new List<MyObject>(myObjectList1);
          var myObjectList3 = new List<MyObject>(myObjectList1);
    
          //Run the benchamrk
          RunTest(myObjectList1, RemoveDuplicatesLF, "Louis");
          RunTest(myObjectList2, RemoveDuplicatesCB, "Chris");
          RunTest(myObjectList3, RemoveDuplicatesMB, "Mita");
          
          Console.ReadLine();
        }
    
        static MyObject Clone(MyObject original) {
          return new MyObject { Point = original.Point, Value = original.Value };
        }
        static MyObject RandomMyObject() {
          return new MyObject {
            Point = new Point(random.Next(), random.Next()),
            Value = random.NextDouble() * 1000 };
        }
        static void RunTest(List<MyObject> list, RemoveAction proc, string key) 
        {
          Stopwatch stopwatch = new Stopwatch();
    
          Console.WriteLine
            (
              "Start removement {0}: {1}",
              key, 
              DateTime.Now.TimeOfDay.ToString("")
            );
    
          var startCount = list.Count;
          stopwatch.Start();
          proc(ref list);
          stopwatch.Stop();
    
          Console.WriteLine
            (
              "Stop removement {0}: {1} msecs, {2} objects removed",
              key,
              stopwatch.ElapsedMilliseconds,
              startCount - list.Count
            );
          Console.WriteLine();
        }
    
        static void RemoveDuplicatesCB(ref List<MyObject> myObjectList)
        {
          HashSet<Point> set = new HashSet<Point>();
          List<MyObject> cleanedList = new List<MyObject>(myObjectList.Count);
    
          foreach (MyObject obj in myObjectList)
          {
            if (!set.Contains(obj.Point))
            {
              set.Add(obj.Point);
              cleanedList.Add(obj);
            }
          }
          myObjectList = cleanedList;
        }
    
        static void RemoveDuplicatesLF(ref List<MyObject> myObjectList)
        {
          Dictionary<Point, MyObject> already = new Dictionary<Point, MyObject>();
          foreach (var item in myObjectList)
          {
            if (already.ContainsKey(item.Point) && already[item.Point].Value >= item.Value)
              continue;
            already[item.Point] = item;
          }
    
          myObjectList = new List<MyObject>(already.Values);
    
    
        }
    
        static void RemoveDuplicatesMB(ref List<MyObject> originalList)
        {
    
          for (int i = 0; i < originalList.Count; i++)
          {
            Point _point = originalList[i].Point;
            for (int j = 0; j < originalList.Count; j++)
            {
              Point _point2 = originalList[j].Point;
              if (i != j)
              {
                if (_point == _point2)
                {
                  originalList.RemoveAt(i);
                }
              }
            }
          }
    
        }
    
      }
    
      public class MyObject
      {
        public Point Point { get; set; }
        public double Value { get; set; }
      }
    
    }
    

    Wednesday, December 22, 2010 10:00 PM
  • Using Linq:

        var sortedList = myObjectList
              .OrderBy(o => o.Point.X)
              .ThenBy(o => o.Point.Y)
              .ToList();
    
    


    Chris

    Wednesday, December 22, 2010 10:05 PM
  • You can sort them first, then keep only the highest Value per Point:

      myObjectList = (
        from o in myObjectList
        orderby o.Point.X, o.Point.Y, o.Value descending
        group o by o.Point into g
        select g.First()
      ).ToList();
    

    Wednesday, December 22, 2010 10:21 PM
  •  

    Hi  BBauer42,

     

    Have you clear about this question?

     

    If there's any concern, please feel free to let me know.

     

    Have a nice day!


    Mike [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

    Monday, December 27, 2010 3:17 AM
    Moderator
  • Yes I am all set, I marked a post as the solution. Thanks.
    Monday, December 27, 2010 3:04 PM
  •  

    Hi  BBauer42,

     

    Thanks for your response! I'm glad that you have clear about this problem.

     

    If there's any concern, please feel free to let me know.

     

    Have a nice day!


    Mike [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

    Tuesday, December 28, 2010 3:49 AM
    Moderator
  • Hi,  would this work? I'm not sure, just asking...

    myObjectList = (
        from o in myObjectList
        orderby o.Point.X, o.Point.Y, o.Value descending
        
        select o.First()
      ).ToList();

    Note: I just essentially removed the line

    group o by o.Point into g

    Is Grouping really required here?

    Thanks for reply in advance.


    Saturday, December 30, 2017 2:33 PM
  • Maybe,

    But this question is from 2010, Linq was then just 2 years old.

    That is why in my idea all those sites who make code for infinity a fake and die slowly after a while. 

    If you have a real question, then make a new question. Almost nobody looks to an thread 7 years old. 


    Success Cor

    Saturday, December 30, 2017 6:34 PM