locked
LINQ GroupBy not working RRS feed

  • Question

  • Hi,

    I've got a problem implementing a GroupBy using an EqualityComparer class (it doesn't make a difference if I use IEqualityComparer either). The GroupBy clause returns NO data.  If I check results.Count(), it equals 0.

    From the debug output, the GetHashCode() appears to get called for each element and it appears to return an unique int, but the Equals() method only gets called about 12 times when I have about 40,000 elements in the collection.

    If I don't use LINQ and write code to find duplicates, the MatchData class works as intended and I find the duplicate records as required.

    Any help would be greatly appreciated.

    Dale.

        private void CheckForDuplicateAddresses(Customer[] customers)
        {
          //build matching data
          List<MatchData> list = new List<MatchData>();
          foreach (Customer customer in customers)
          {
            list.Add(new MatchData(customer));
          }
    
          var results = list.GroupBy(c => c, new MatchData()).Where(g => g.Count() > 1).Select(x => new { Key = x.Key.Address, Customers = x.ToList() });
    
          foreach (var result in results)
          {
            row++;
            worksheet.Cells[row, 1].Value = "Address";
            foreach (MatchData md in result.Customers)
            {
              row++;
              worksheet.Cells[row, 2].Value = md.Customer.Code;
              worksheet.Cells[row, 3].Value = md.Customer.Name;
              worksheet.Cells[row, 4].Value = md.Customer.Address;
            }
          }
    
        }
    
        public class MatchData : EqualityComparer<MatchData>, IEquatable<MatchData>
        {
          public Customer Customer { get; private set; }
          public string[] Name { get; private set; }
          public string[] Address { get; private set; }
    
          public MatchData() : base() { }
    
          public MatchData(Customer customer)
          {
            Customer = customer;
            Name = Split(customer.Name);
            Address = Split(customer.Address);
          }
    
          #region Equals
    
          public override bool Equals(object obj)
          {
            if (obj is MatchData)
              return Equals((MatchData)obj);
            else
              return false;
          }
    
          public bool Equals(MatchData other)
          {
            if (this.Name != null && other.Name != null)
            {
              if (this.Name.Intersect(other.Name).Count() < 1)
                return false;
            }
            //allow both to be null, but NOT one or the other
            else if (!(this.Name == null && other.Name == null))
              return false;
    
            if (this.Address != null && other.Address != null)
            {
              if (this.Address.Intersect(other.Address).Count() < 3)
                return false;
            }
            else
              return false;
    
            return true;
          }
    
          public static bool operator ==(MatchData x, MatchData y)
          {
            return x.Equals(y);
          }
    
          public static bool operator !=(MatchData x, MatchData y)
          {
            return !x.Equals(y);
          }
    
          #endregion equals
    
          private string[] Split(string text)
          {
            if (string.IsNullOrEmpty(text))
              return null;
    
            List<string> words = new List<string>();
            int len = text.Length;
            int start = 0;
            for (int i = 0; i <= len; i++)
            {
              if (i == len || !char.IsLetterOrDigit(text[i]))
              {
                if (i == start)
                  start++;
                else
                {
                  //if length of word is less than 1 character, then drop it
                  if (i - start > 1)
                  {
                    words.Add(text.Substring(start, i - start));
                  }
                  start = i + 1;
                }
              }
            }
    
            return words.ToArray();
          }
    
    
          public override bool Equals(MatchData x, MatchData y)
          {
            System.Diagnostics.Debug.WriteLine("Equals: " + x.Customer.Name + " == " + y.Customer.Name);
            return x.Equals(y);
          }
    
          public override int GetHashCode(MatchData obj)
          {
            //System.Diagnostics.Debug.WriteLine("GetHashCode: " + obj.Customer.Name + " : " + obj.GetHashCode());
            return obj.GetHashCode();
          }
        }
    
    


    Friday, December 4, 2015 4:11 AM

Answers

  • Yes ,you are right.

    it is my fault. i had missed the Split.

    so i create a simply project to test group by action.
    first i create a new data class like below, it is simply class,but more clear

        public class MyClasTest : IEquatable<MyClasTest>,IEqualityComparer<MyClasTest>
        {
            public string Name = string.Empty;
    
            public bool Equals(MyClasTest other)
            {
                Console.WriteLine("IEquatable invoke");
                return this.Name.Equals(other.Name);
            }
    
            public override bool Equals(object obj)
            {
                return this.Equals(obj as MyClasTest);
            }
    
            bool IEqualityComparer<MyClasTest>.Equals(MyClasTest x, MyClasTest y)
            {
                Console.WriteLine("IEqualityComparer invoke");
                return x.Name.Equals(y.Name);
            }
    
            int IEqualityComparer<MyClasTest>.GetHashCode(MyClasTest obj)
            {
                Console.WriteLine("IEqualityComparer gethaskcode invoke");
                return obj.GetHashCode();
            }
        }

    next ,create a list includes 9 items to execute groupby method like below 

            List<MyClasTest> _ltest = new List<MyClasTest>();
            _ltest.Add(new MyClasTest() { Name = "1111" });
            _ltest.Add(new MyClasTest() { Name = "1111" });
            _ltest.Add(new MyClasTest() { Name = "112" });
            _ltest.Add(new MyClasTest() { Name = "1122" });
            _ltest.Add(new MyClasTest() { Name = "11444" });
            _ltest.Add(new MyClasTest() { Name = "11444" });
            _ltest.Add(new MyClasTest() { Name = "1111" });
            _ltest.Add(new MyClasTest() { Name = "112" });
            _ltest.Add(new MyClasTest() { Name = "1111" });
    
    
            var _rsult = _ltest.GroupBy(_e => _e, new MyClasTest());
    
            Console.WriteLine("-------"+_rsult.Count().ToString());

    and  the groupby method result output is below 

    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    -------9  // count is 9 . represent , every group include only one item. group.count = 1

    obviously, it's wrong.

    why? let's take look at the "return obj.GetHashCode".... 
    and i override the MyClasTest.GetHashCode() method,

    public class MyClasTest : IEquatable<MyClasTest>,IEqualityComparer<MyClasTest>
        {
            public string Name = string.Empty;
    
            public bool Equals(MyClasTest other)
            {
                Console.WriteLine("IEquatable invoke");
                return this.Name.Equals(other.Name);
            }
    
            public override bool Equals(object obj)
            {
                return this.Equals(obj as MyClasTest);
            }
    
            public override int GetHashCode()
            {
                Console.WriteLine("get base hash code");
                return base.GetHashCode();
    
            }
    
            bool IEqualityComparer<MyClasTest>.Equals(MyClasTest x, MyClasTest y)
            {
                Console.WriteLine("IEqualityComparer invoke");
                return x.Name.Equals(y.Name);
            }
    
            int IEqualityComparer<MyClasTest>.GetHashCode(MyClasTest obj)
            {
                Console.WriteLine("IEqualityComparer gethaskcode invoke");
                return obj.GetHashCode();
            }
        }

     and run test again. the out put is below

    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    -------9

    obvioursly, the MyClasTest.GetHashCode override method cause the problem.
    because the default GetHashCode return the unique value for every MyClasTest object (as same as the your MatchData)
    so. groupby ignore the following Equals operation

    if i override the GetHashCode like below 

            public override int GetHashCode()
            {
                Console.WriteLine("get base hash code");
                //return base.GetHashCode();
                return this.Name.GetHashCode();
    
            }

    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer invoke
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer invoke
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer invoke
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer invoke
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer invoke
    -------4

    it would work fine and return correct result.

    the conclusion is .you must override MatchData.GetHashCode, because of the Equals method is override. below warning reference from

     https://msdn.microsoft.com/en-us/library/system.object.gethashcode%28v=vs.110%29.aspx

    "If you override the GetHashCode method, you should also override Equals, and vice versa. If your overridden Equals method returns true when two objects are tested for equality, your overridden GetHashCodemethod must return the same value for the two objects. "

    hope the test helps


    DON'T TRY SO HARD,THE BEST THINGS COME WHEN YOU LEAST EXPECT THEM TO.

    • Edited by Matthew LEAN . D Saturday, December 5, 2015 9:51 AM
    • Proposed as answer by Albert_Zhang Friday, December 11, 2015 12:23 PM
    • Marked as answer by Kristin Xie Tuesday, December 15, 2015 2:08 AM
    Saturday, December 5, 2015 2:27 AM

All replies

  • I think the Equals override method cause the problem

    take a look below code snippets

          public bool Equals(MatchData other)
          {
            if (this.Name != null && other.Name != null)
            {
              if (this.Name.Intersect(other.Name).Count() < 1)
                return false;
            }
            //allow both to be null, but NOT one or the other
            else if (!(this.Name == null && other.Name == null))
              return false;
    
            if (this.Address != null && other.Address != null)
            {
              if (this.Address.Intersect(other.Address).Count() < 3)
                return false;
            }
            else
              return false;
    
            return true; /// the flow might not go to here forever...
          }

    the method would return false ever..

    so. the Group.Count() alway equal 1....

    .Where(g => g.Count() > 1) /// would not be matched ever..

    and by the way .the thought the Intersect method might not be good solution. it match any order and discrete (e.g, "x1x2,x3,,5" Intersect "1,4,2,d,5" would match "1,2,5" and count() = 3....

    i don't think that is your expect for.  if that so . my name Matthew = Mohammed? because Intersect method would return match "M,a,h,e" which count > 3 , it is funny!


    DON'T TRY SO HARD,THE BEST THINGS COME WHEN YOU LEAST EXPECT THEM TO.

    Friday, December 4, 2015 6:37 AM
  • You can get true quite easily.  When the object is initially created from the Customer object, it splits the name and address into individuals words.  So Name could equal { "John", "Smith" } and Address could be { "1", "White", "Road", "Somewhere" }.  The Equals() method only wants 1 word from the Name and 3 words from the Address to match.  My actual code does more fuzzy matching, so I can even get a few more matches, and it does work.

    I am currently using the same class (hence a few extra methods) to do the matching as if LINQ doesn't exist.  The matching code does work because I've found about 8% of our customers are duplicates.

    Dale.

    Friday, December 4, 2015 1:08 PM
  • Hi,

    I've reused the MatchData class to implement IEqualityComparer<>, and from what I've seen suggested on the internet, it also implements IEqualable<>.  This way I don't have to maintain duplicate code.

    The GroupBy clause can use an IEqualityComparer<> object for custom comparisons.  As I've mentioned above, the Equals() method works, just the Equals(<T> x, <T> y) method never appears to get called when using the LINQ query.

    Dale.


    • Edited by Dale Harris Wednesday, December 9, 2015 4:35 AM accidentally deleted
    Friday, December 4, 2015 1:24 PM
  • Implement a HashCode, it's used first on equal operations cause of performace.


    • Edited by MDeero Friday, December 4, 2015 1:47 PM
    Friday, December 4, 2015 1:43 PM
  • I've already got a GetHashCode() method as it's required by the IEqualityComparer<> interface.  I have tried returning the unique Customer ID, and in the above code, Object.GetHashCode().  It makes no difference which way I do it.

    The problem is I don't want GroupBy clause relying upon GetHashCode() as the Equals() method is dynamic.  That is object A could equal B and C, but that does mean object B equals C because B and C could equal A for totally different reasons (that is Name and Address properties could match on totally different words for each object).

    Dale.

    Friday, December 4, 2015 1:59 PM
  • Well if you want GroupBy to not use what it is build to use or not... it will use it!

    And the GetHash thats presented is just obj.GetHashCode(), it is not sufficient.

    Please follow the link, it describes you issue (not with GroupBy, but with Distinct)

    BTW, you can write your own LINQ extensions, and the source code is available if you need insight

    • Edited by MDeero Friday, December 4, 2015 2:28 PM
    Friday, December 4, 2015 2:23 PM
  • I have already made GetHashCode() return the Customer ID, but it made no difference.

    From reading the following link, It says NOT to assume equal hash code means that the objects are equal.  BUT, when you read on a bit further, it says that you should return equal hash codes if the objects are equal.  What is right and what is wrong.

    https://msdn.microsoft.com/en-us/library/system.object.gethashcode%28v=vs.110%29.aspx

    You mention to follow a link in your reply, but the only link you have didn't seem to link to what you were talking about.

    Dale.

    Friday, December 4, 2015 10:37 PM
  • Yes ,you are right.

    it is my fault. i had missed the Split.

    so i create a simply project to test group by action.
    first i create a new data class like below, it is simply class,but more clear

        public class MyClasTest : IEquatable<MyClasTest>,IEqualityComparer<MyClasTest>
        {
            public string Name = string.Empty;
    
            public bool Equals(MyClasTest other)
            {
                Console.WriteLine("IEquatable invoke");
                return this.Name.Equals(other.Name);
            }
    
            public override bool Equals(object obj)
            {
                return this.Equals(obj as MyClasTest);
            }
    
            bool IEqualityComparer<MyClasTest>.Equals(MyClasTest x, MyClasTest y)
            {
                Console.WriteLine("IEqualityComparer invoke");
                return x.Name.Equals(y.Name);
            }
    
            int IEqualityComparer<MyClasTest>.GetHashCode(MyClasTest obj)
            {
                Console.WriteLine("IEqualityComparer gethaskcode invoke");
                return obj.GetHashCode();
            }
        }

    next ,create a list includes 9 items to execute groupby method like below 

            List<MyClasTest> _ltest = new List<MyClasTest>();
            _ltest.Add(new MyClasTest() { Name = "1111" });
            _ltest.Add(new MyClasTest() { Name = "1111" });
            _ltest.Add(new MyClasTest() { Name = "112" });
            _ltest.Add(new MyClasTest() { Name = "1122" });
            _ltest.Add(new MyClasTest() { Name = "11444" });
            _ltest.Add(new MyClasTest() { Name = "11444" });
            _ltest.Add(new MyClasTest() { Name = "1111" });
            _ltest.Add(new MyClasTest() { Name = "112" });
            _ltest.Add(new MyClasTest() { Name = "1111" });
    
    
            var _rsult = _ltest.GroupBy(_e => _e, new MyClasTest());
    
            Console.WriteLine("-------"+_rsult.Count().ToString());

    and  the groupby method result output is below 

    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    IEqualityComparer gethaskcode invoke
    -------9  // count is 9 . represent , every group include only one item. group.count = 1

    obviously, it's wrong.

    why? let's take look at the "return obj.GetHashCode".... 
    and i override the MyClasTest.GetHashCode() method,

    public class MyClasTest : IEquatable<MyClasTest>,IEqualityComparer<MyClasTest>
        {
            public string Name = string.Empty;
    
            public bool Equals(MyClasTest other)
            {
                Console.WriteLine("IEquatable invoke");
                return this.Name.Equals(other.Name);
            }
    
            public override bool Equals(object obj)
            {
                return this.Equals(obj as MyClasTest);
            }
    
            public override int GetHashCode()
            {
                Console.WriteLine("get base hash code");
                return base.GetHashCode();
    
            }
    
            bool IEqualityComparer<MyClasTest>.Equals(MyClasTest x, MyClasTest y)
            {
                Console.WriteLine("IEqualityComparer invoke");
                return x.Name.Equals(y.Name);
            }
    
            int IEqualityComparer<MyClasTest>.GetHashCode(MyClasTest obj)
            {
                Console.WriteLine("IEqualityComparer gethaskcode invoke");
                return obj.GetHashCode();
            }
        }

     and run test again. the out put is below

    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    -------9

    obvioursly, the MyClasTest.GetHashCode override method cause the problem.
    because the default GetHashCode return the unique value for every MyClasTest object (as same as the your MatchData)
    so. groupby ignore the following Equals operation

    if i override the GetHashCode like below 

            public override int GetHashCode()
            {
                Console.WriteLine("get base hash code");
                //return base.GetHashCode();
                return this.Name.GetHashCode();
    
            }

    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer invoke
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer invoke
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer invoke
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer invoke
    IEqualityComparer gethaskcode invoke
    get base hash code
    IEqualityComparer invoke
    -------4

    it would work fine and return correct result.

    the conclusion is .you must override MatchData.GetHashCode, because of the Equals method is override. below warning reference from

     https://msdn.microsoft.com/en-us/library/system.object.gethashcode%28v=vs.110%29.aspx

    "If you override the GetHashCode method, you should also override Equals, and vice versa. If your overridden Equals method returns true when two objects are tested for equality, your overridden GetHashCodemethod must return the same value for the two objects. "

    hope the test helps


    DON'T TRY SO HARD,THE BEST THINGS COME WHEN YOU LEAST EXPECT THEM TO.

    • Edited by Matthew LEAN . D Saturday, December 5, 2015 9:51 AM
    • Proposed as answer by Albert_Zhang Friday, December 11, 2015 12:23 PM
    • Marked as answer by Kristin Xie Tuesday, December 15, 2015 2:08 AM
    Saturday, December 5, 2015 2:27 AM
  • If your CustomerID is indeed a unique value for your MatchData class, it should have the desired effect and there is something else going wrong too. (In which case i am lost)

    The link is not referring to reading material, its just for the .NET open sources, where you could look zp the method to get the idea how it works and the implement an own (extension) method to suit your needs (for example without the GetHashCode involved). I don't say this would be easy, but everything that happens inside the LINQ can be looked up. It's also great for understanding why something doesn't work although you think everything is fine... (as such it was just meant to point you to the "source" of the problem ;-)

    Saturday, December 5, 2015 11:15 AM
  • If your CustomerID is indeed a unique value for your MatchData class, it should have the desired effect and there is something else going wrong too. (In which case i am lost)

    The link is not referring to reading material, its just for the .NET open sources, where you could look zp the method to get the idea how it works and the implement an own (extension) method to suit your needs (for example without the GetHashCode involved). I don't say this would be easy, but everything that happens inside the LINQ can be looked up. It's also great for understanding why something doesn't work although you think everything is fine... (as such it was just meant to point you to the "source" of the problem ;-)

         so . the Equals method main part implemented  by Intersect method that cause that difficult to Generate the same HashCode between the MatchData objects which's Equals method return true.

     let's  say ,there has two Name match, A is "Hello World A" , B is "Hello Word B",  How to generate the Hash Code of A as same as hash Code from B?  their Hash code be access before their Equals method....

    So before Equals method invoked , it don't know the Intersect Part,  ---> not has Intersect part can not generate the same hash code.

    ---> not hash code can not go to Equals method....

    just like dead lock...

    maybe it would work if the MatchData's GetHashCode override method return a const int value (e.g 1),,,

    if every MatchData GetHashCode method return 1, i would through over GetHashCode and go to Equals method.

    but i don't think is works efficiently

            public override int GetHashCode()
            {
                Console.WriteLine("get base hash code");
                //return base.GetHashCode();
                return 1;
            }

    ......


    DON'T TRY SO HARD,THE BEST THINGS COME WHEN YOU LEAST EXPECT THEM TO.

    Saturday, December 5, 2015 11:41 AM
  • Each MatchData has string[] Name and string[] Address, which you get by splitting Customer.Name and Customer.Address. You want to treat the MatchData instances as equal if Name has at least one common string and Address has at least three common strings.

    Now suppose you have:

    1. list[0] = MatchData { Name = { "NNN", "OOO" }, Address = { "AAA", "BBB", "CCC", "DDD" } }
    2. list[1] = MatchData { Name = { "PPP", "OOO" }, Address = { "111", "BBB", "CCC", "DDD" } }
    3. list[2] = MatchData { Name = { "PPP", "QQQ" }, Address = { "111", "222", "CCC", "DDD" } }
    4. list[3] = MatchData { Name = { "RRR", "QQQ" }, Address = { "111", "222", "333", "DDD" } }

    With this input, you want to treat the MatchData instances as follows:

    list[0] list[1] list[2] list[3]
    list[0] equal equal different different
    list[1] equal equal equal different
    list[2] different equal equal equal
    list[3] different different equal equal

    This is not a transitive relation and cannot be reliably implemented as IEquatable<T> or IEqualityComparer<T>. It isn't even clear how many groups you'd want GroupBy to return in this case.

    Saturday, December 5, 2015 5:44 PM
  • Hi,

    Thankyou for the replies.  I've got it working by not using LINQ, and I think I'll just keep it that way as I'm still trying to understand LINQ fully.

    It also is a bit confusing when reading Microsoft documentation.  When reading the following link, the link Matthew gave above, the first bullet point under Remarks, says "You should NOT assume that equal hash codes imply object equality".  BUT, as Matthew pointed out when you read on a bit further, it says that you should return equal hash codes if the objects are equal.  What is right and what is wrong.

    https://msdn.microsoft.com/en-us/library/system.object.gethashcode%28v=vs.110%29.aspx

    My knowledge of LINQ isn't great, and when the creator can't write a clear and concise document, it just makes it harder.  I understand LINQ enough to extract other data correctly, but in this case, I think I'll give it a miss.

    Dale.
    • Edited by Dale Harris Wednesday, December 9, 2015 4:18 AM
    Wednesday, December 9, 2015 4:15 AM
  • Hi,

    I'm very sorry, but I have discovered that I accidently deleted some forum posts when I was trying to remove a post I made mistakenly.  Does anyone know how to undelete messages so I can make them available?  I have tried searching on how it maybe possible, but to no avail.

    Dale.

    Wednesday, December 9, 2015 4:26 AM
  • Returning Same Hash Code is the necessary condition for equality of two objects, 

    but not sufficient and necessary condition

    So. if hash codes are difference, the objects must be not equality

    If hash codes are same. the objects go to next compare phase using Equals method.


    DON'T TRY SO HARD,THE BEST THINGS COME WHEN YOU LEAST EXPECT THEM TO.

    Wednesday, December 9, 2015 4:42 AM