Answered by:
GetHashCode

Hi all,
I need to implement a GetHashCode for my vector3 class but the x, y, z components are float so this technique doesn’t work:
return x ^ y ^ z;
I can use x.GetHashCode()…, but I have read on the msdn that it is not advised to base your GetHashCode to MS’.
So what should I do?
Lastly on msdn it says:
“However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.”
Does this mean, it’s ok if they return the same values?
Thanks a lot,
Aw
Question
Answers

Unless the invariant (value) of your class class is represented within 32bits, there's no possible way that GetHashCode can return unique values for every possible instance of your class, despite what MSDN says about what GetHashCode returns. It's a hash code, used to group objects by value to aid in sorting, etc. Be sure to use an algorithm that calculates a reasonably unique value and you're done.

Azurewrath wrote: I can use x.GetHashCode()…, but I have read on the msdn that it is not advised to base your GetHashCode to MS’.
Where did you read that it's not advised to base you hash function on MS? I think you might have misunderstood something. In any case, the following should be a perfectly reasonable implementation (in fact, one of their own examples on the doc page for GetHashCode does almost exactly this):
x.GetHashCode() ^ y.GetHashCode() ^ z.GetHashCode()

The invariant or the "value" of a class is the data members. A class may have data members but may not have any "value". e.g. a class that simply references other objects likely doesn't have "value"in which case GetHashCode is meaningless, other than to signify a reference object (i.e. call Object.GetHashCode).
In you case you have 3 floats, each of which are 32bits. You're "value" is 96bits. GetHashCode returns a 32bit value. Calculating a 32bit hash from a 96bits of data will result in duplicate hash codes. If you had only 32bits of data you could simply return those 32bits an get a unique hash. (returning a float as a int/uint requires use of BitConverter).

To explain a bit further... With a 32bit hash there are 2 to 32^{nd} power different hash codes possible. If there are 33 bits in your "value" then there are twice as many possible object values as there are possible hash codes. Hence, on the average, each hash code will represent two different values of your object. Different object values that result in the same hash code are hash collisions.
With 96 bits in your object there are 96 minus 32, or 64 bits difference between the hash and the object values. Hence, on the average, each hash code will have to represent 2 to the 64^{th} power different object values.
One of the difficulties in choosing hash functions is the even distribution of the collisions.It's a rare situation when this is a simple problem. In other words, "Warning! Here be dragons!"

Azurewrath wrote:
Nimrand, I remember reading it, but I guess I was wrong I use the above implementation. So is this what I could always use if I can, as in using base type's GetHashCode method?If by that you mean the using the GetHashCode method of the fields that make up the class's value, then, yes, you could always do it this way (the base type's hash code function would actually refer to base.GetHashCode(), which is usually not a good idea to use). There are circumstances where it may not be the best choice of hash code functions, though. But, writing a good hash function can be tricky. So, this or similar methods can be a good default, and you can always work on making a better one if you find that one is getting a lot of collisions in your hashtables.

Azurewrath wrote: Thanks guys. Very helpful.
So even this has the same duplicate problem:
x.GetHashCode() ^ y.GetHashCode() ^ z.GetHashCode()
?
I mean, this is still 96 bits, right? So it has the same potential collision chance?
Thanks,
Aw 
Peter Ritchie wrote: Azurewrath wrote: Yes, for example, the hash code for x=10.0, y=10.0, z=0.0 is the same as x=10.0, y=0.0, z = 10.0
Which for a vector class may actually be a fairly serious flaw, since it is not uncommon to deal with vector positions that are the same except being oriented along different axis.

As long as it always returns the same hash code for two equal values, then it will never lead to incorrect behavior. Collisions only make hashtables slower. If your GetHashCode function always returns 0, then your hashtable algorithm effectively becomes a search through a linked list, which defeats the purpose of using a hashtable.
I wouldn't worry about it too much unless you're actually going to use your class in a hashtable (such as Dictionary<>). Otherwise, just use something simple: such as either of my suggestions, and be done with it.
All replies

Unless the invariant (value) of your class class is represented within 32bits, there's no possible way that GetHashCode can return unique values for every possible instance of your class, despite what MSDN says about what GetHashCode returns. It's a hash code, used to group objects by value to aid in sorting, etc. Be sure to use an algorithm that calculates a reasonably unique value and you're done.

Azurewrath wrote: I can use x.GetHashCode()…, but I have read on the msdn that it is not advised to base your GetHashCode to MS’.
Where did you read that it's not advised to base you hash function on MS? I think you might have misunderstood something. In any case, the following should be a perfectly reasonable implementation (in fact, one of their own examples on the doc page for GetHashCode does almost exactly this):
x.GetHashCode() ^ y.GetHashCode() ^ z.GetHashCode()

Thanks guys.
Nimrand, I remember reading it, but I guess I was wrong I use the above implementation. So is this what I could always use if I can, as in using base type's GetHashCode method?
Also Peter, I am not sure what you mean by the invariant/value of the class? Do you mean if I use a value that is less than 32bit or more than 32bit, there is no way to guarantee unique values?
Thanks alot again,
Aw 
The invariant or the "value" of a class is the data members. A class may have data members but may not have any "value". e.g. a class that simply references other objects likely doesn't have "value"in which case GetHashCode is meaningless, other than to signify a reference object (i.e. call Object.GetHashCode).
In you case you have 3 floats, each of which are 32bits. You're "value" is 96bits. GetHashCode returns a 32bit value. Calculating a 32bit hash from a 96bits of data will result in duplicate hash codes. If you had only 32bits of data you could simply return those 32bits an get a unique hash. (returning a float as a int/uint requires use of BitConverter).

To explain a bit further... With a 32bit hash there are 2 to 32^{nd} power different hash codes possible. If there are 33 bits in your "value" then there are twice as many possible object values as there are possible hash codes. Hence, on the average, each hash code will represent two different values of your object. Different object values that result in the same hash code are hash collisions.
With 96 bits in your object there are 96 minus 32, or 64 bits difference between the hash and the object values. Hence, on the average, each hash code will have to represent 2 to the 64^{th} power different object values.
One of the difficulties in choosing hash functions is the even distribution of the collisions.It's a rare situation when this is a simple problem. In other words, "Warning! Here be dragons!"

Azurewrath wrote:
Nimrand, I remember reading it, but I guess I was wrong I use the above implementation. So is this what I could always use if I can, as in using base type's GetHashCode method?If by that you mean the using the GetHashCode method of the fields that make up the class's value, then, yes, you could always do it this way (the base type's hash code function would actually refer to base.GetHashCode(), which is usually not a good idea to use). There are circumstances where it may not be the best choice of hash code functions, though. But, writing a good hash function can be tricky. So, this or similar methods can be a good default, and you can always work on making a better one if you find that one is getting a lot of collisions in your hashtables.


Azurewrath wrote: Thanks guys. Very helpful.
So even this has the same duplicate problem:
x.GetHashCode() ^ y.GetHashCode() ^ z.GetHashCode()
?
I mean, this is still 96 bits, right? So it has the same potential collision chance?
Thanks,
Aw 

Peter Ritchie wrote: Azurewrath wrote: Yes, for example, the hash code for x=10.0, y=10.0, z=0.0 is the same as x=10.0, y=0.0, z = 10.0
Which for a vector class may actually be a fairly serious flaw, since it is not uncommon to deal with vector positions that are the same except being oriented along different axis.


This might be a better implementation in this case (though it depends upon how Int64.GetHashCode() is implemented):
((((long)x) << 4)  ((long)y)).GetHashCode() ^ z.GetHashCode();Edit: Scratch that. I just did some testing, and found out that all Int32.GetHashCode() does is return its own value. That really surprises me, actually.


One of my colleagues showed me that even your GetHashCode method always return 0, it's still gonna work, and it did. So is it only a speed issue, and no matter how many collisions it might have, it will still gonna find the right value (in the dictionary, etc)?
Thanks,
Aw 
As long as it always returns the same hash code for two equal values, then it will never lead to incorrect behavior. Collisions only make hashtables slower. If your GetHashCode function always returns 0, then your hashtable algorithm effectively becomes a search through a linked list, which defeats the purpose of using a hashtable.
I wouldn't worry about it too much unless you're actually going to use your class in a hashtable (such as Dictionary<>). Otherwise, just use something simple: such as either of my suggestions, and be done with it.
