none
Hash table buckets

    Question

  • Hello,

    I'm just trying to wrap my brain around the hash table implementation used by Dictionary<Tkey, TValue>. It's taken me a bit to figure out what I think a "bucket" is. Could some one confirm if this is correct or not.

    1. A "bucket" is really just an element of the int[] buckets array the Dictionary<Tkey, TValue> declares?? Its indexed by the hash code...

    2. A "bucket" can contain only 0 or 1 entries (though I'm not quite clear how: indexing into the entries array?), unless there is a collision between hashed keys.

    3. In the event of a collision, Dictionary<Tkey, TValue> uses a chaining collision resolution strategy, that is, the original value in the bucket becomes the head of a liked list to which the "colliding" value is added?? Is this the purpose of the 'next' field on the Entry struct definition?? I'm assuming it is.

    So a bucket can contain 0 or 1 element unless there is a collision of hashed keys in which case a bucket contains a liked list of 2 or more elements with the same hash value???

    Something like that.

    And what is the 'freelist' variable doing? Is that just pointing to an empty bucket in the bucket array??? A random empty element? The 'next' empty element? Or something completely different.

    I'm going to have to step through the code a bunch more to get a better feel for how the 'entries' array and the 'buckets' array work together, but I'm just trying to get some basic vocabulary nailed down.

    Oh, and to resolve a collision where two keys hash to the same value, I guess thats where a simple equality comparison then comes into play, since the hash code can't differentiate the values any longer??

    Thanks.
    Wednesday, February 03, 2010 9:49 PM

Answers

  • It's much simpler than that.

    Let's say you have a string[] with 1000 strings. Now if you want to know if the array contains a particular string:

    string[] strings = GetBigStringArray();

    if (strings.Contains("Some String Value"))
       ...

    Array.Contains has to walk the entire array and test every single string until it either finds a match and stops looking, or it has walked the entire array (failure = O(length)).

    The basic idea of a hashed collection (a collection of buckets) is to break that large array into much smaller lists so that the entire list doesn't have to be checked every time.

    It does this by determining which bucket an item belongs to before adding it, or which bucket a search for item would be in if it were in the collection before searching.

    One very simple (and not very good) way to hash a list of strings would be this:

    1) Create 26 buckets.
    2) Add strings to each bucket based on what letter of the alphabet the string begins with.

    If we add these strings:
    George
    Mary
    Michael
    John

    The 'G' bucket has 1 string, the 'M' bucket has 2 strings, and the 'J' bucket has 1 string (all remaining buckets have 0 strings).

    If we want to search for "Jerry", we look in the 'J' bucket and perform only one string comparison rather than the 4 comparison's that would be required if we had to search the entire collection.

    The "hash algorithm" for our simple collection is "first letter of the string", which won't work very well if we add a bunch of strings that all happen to start with the same letter. A perfect hash algorithm would result in every single bucket having the same number of items (plus or minus one). There's no such thing as a perfect algorithm (unless you know the domain of the entire list ahead of time). But there are some good ones that are commonly used. I would refer you to Google if you're interested in hashing algorithms.

    The other consideration is the number of buckets. The more buckets, the fewer items each bucket will contain, therefore the faster searches happen.
    • Marked as answer by Roahn Luo Tuesday, February 09, 2010 8:54 AM
    Thursday, February 04, 2010 9:48 PM
  • You just restated everything that I stated in my post.

    The differences are purely implementation details.

    1) The dictionary stores a specialized key/value structure that has two extra fields: HashCode and Next.
    2) The buckets are implemented with a linked list, the Next field is used for this implementation.
    3) The hash algorithm produces two values: A quick comparer and a bucket index.

    The HashCode value stored in the entry is used to make fast comparisons. If two items have different HashCodes, they can't be the same item, so this test is very fast. When comparing, FindEntry will perform a thorough comparison only if the search item and the bucket item have the same HashCode.

    The "freelist" is simply a block of pre-allocated structures which the Dictionary uses when adding new items. It allocates blocks of memory at a time, rather than one each time an item is added.

    All of these 'details' are just performance enhancements. Everything I said above is still true.
    • Marked as answer by Roahn Luo Tuesday, February 09, 2010 8:54 AM
    Friday, February 05, 2010 2:17 PM

All replies

  • It's much simpler than that.

    Let's say you have a string[] with 1000 strings. Now if you want to know if the array contains a particular string:

    string[] strings = GetBigStringArray();

    if (strings.Contains("Some String Value"))
       ...

    Array.Contains has to walk the entire array and test every single string until it either finds a match and stops looking, or it has walked the entire array (failure = O(length)).

    The basic idea of a hashed collection (a collection of buckets) is to break that large array into much smaller lists so that the entire list doesn't have to be checked every time.

    It does this by determining which bucket an item belongs to before adding it, or which bucket a search for item would be in if it were in the collection before searching.

    One very simple (and not very good) way to hash a list of strings would be this:

    1) Create 26 buckets.
    2) Add strings to each bucket based on what letter of the alphabet the string begins with.

    If we add these strings:
    George
    Mary
    Michael
    John

    The 'G' bucket has 1 string, the 'M' bucket has 2 strings, and the 'J' bucket has 1 string (all remaining buckets have 0 strings).

    If we want to search for "Jerry", we look in the 'J' bucket and perform only one string comparison rather than the 4 comparison's that would be required if we had to search the entire collection.

    The "hash algorithm" for our simple collection is "first letter of the string", which won't work very well if we add a bunch of strings that all happen to start with the same letter. A perfect hash algorithm would result in every single bucket having the same number of items (plus or minus one). There's no such thing as a perfect algorithm (unless you know the domain of the entire list ahead of time). But there are some good ones that are commonly used. I would refer you to Google if you're interested in hashing algorithms.

    The other consideration is the number of buckets. The more buckets, the fewer items each bucket will contain, therefore the faster searches happen.
    • Marked as answer by Roahn Luo Tuesday, February 09, 2010 8:54 AM
    Thursday, February 04, 2010 9:48 PM
  • You asked about Dictionary which is slightly more complicated than a simple hash collection. Instead of adding a simple type like a string or integer, Dictionary stores a KeyValuePair. A KeyValuePair is.. well.. a key and a value.

    The Key is used to determain what bucket the item is in and is used for finding the item.
    The Value is just something you associated with the key. It can be anything.
    Thursday, February 04, 2010 9:53 PM
  • Indeed. It is much simpler than that, if we arbitrarily select a much simpler example. :)  But I’m focused on the hash table implementation in Dictionary<TKey, TValue> specifically; not hash tables in general.

     

    First, Dictionary<TKey, TValue> does not use the KeyValuePair<TKey, TValue> struct to store its internal data. Many of its methods accept a KeyValuePair<TKey, TValue> struct as a parameter, or return a KeyValuePair<TKey, TValue> struct. Instead, Dictionary<TKey, TValue> wraps an internal array of a more specialized struct, named simply Entry.

      

           private struct Entry
            {
                public int hashCode;
                public int next;
                public TKey key;
                public TValue value;
            }

     

    Looks similar to a KeyValuePair<TKey, TValue>, but adds two additional fields, ‘next’ and ‘hashcode’ which are critical pieces of its hash table implementation.

    So here’s how buckets work. Most of my previous post was wrong. Each element in the internally allocated ‘buckets’ array is a separate bucket. Each bucket’s value is either a positive integer (the bucket “contains” 1 or more dictionary entries) or -1 (the bucket is empty).  “Contains” in this context means “can be used to lookup”, or "maps to", an entry in the internally allocated ‘entries’ array (an array of Entry structs). So though a bucket’s value is always a single integer, that value can be used to lookup more than one dictionary entry, i.e., the dictionary entries “stored” in that bucket.

    The buckets array can be used like this because of the way Dictionary<TKey, TValue> stores hash codes. Both the entries array and the buckets array store values based on the entries hash code, but in two different forms.

    The Entry.hashcode field stores the value of calling GetHashCode() on the entry. Actually, it stores that value after a bitwise & operation is performed using 0x7fffffff (2,147,483,647 - upper range of signed 32-bit integers) as the second operand. This operation doesn’t modify the value returned by GetHashCode() if that value is positive; but it generates a new, positive integer if that value is negative. This simply ensures that Entry.hashcode always stores a positive integer.

    int num = this.comparer.GetHashCode(key) & 0x7fffffff;

     

    The value stored in the bucket element is based on the value stored in Entry.hashcode, but after performing a modulus operation, using buckets.Length as the second operand.

    int index = num % this.buckets.Length;

     

    This is why the bucket value can be used to lookup more than one element in the entries array. The modulus operation performed on different hashes can yield the same value. That value then becomes an index into the buckets array.

    Here’s the blow-by-blow:

    Lets say we’ve added 12 <string, int> entries to a Dictionary<TKey, TValue>: “One,1”, “Two,2”, ..., “Twelve,12”. The ‘entries’ array will look like this:

     

    The last five elements of the array are simply uninitialized Entry values added after the array auto-resized (you can’t see elements 15 and 16 in the screen capture).

     

    Here’s what our buckets array looks like:

     

     

    The screen capture doesn’t show element 0 and 1, but the value of bucket 0 is 5 (it “contains” 1 or more entries) and the value of bucket 1 is -1 (its empty). So nine of the buckets “contain” 1 or more entries, and 8 of the buckets are empty.

     

    So how do 9 buckets map to 12 entries?

     

    As it happens, on this execution (though this doesn’t remain constant), three of the buckets - buckets[0], buckets[8], and buckets[16] each map to two Entry values. buckets[0] maps to “Four” and “Six”, buckets[8] maps to “Eleven” and “Twelve”, and buckets[16] maps to “One” and “Two”.

     

    Here’s the FindEntry() method:

     

            private int FindEntry(TKey key)
            {
                if (key == null)
                {
                    throw new System.ArgumentNullException();
                }
                if (this.buckets != null)
                {
                    // get the key’s hash code.
                    int num = this.comparer.GetHashCode(key) & 0x7fffffff;
    
                    // modulus operation gives us the buckets array index, 
                    // using the entries hash code as the first operand. 
                    // This gets us to the right bucket. 
                    for (int i = this.buckets[num % this.buckets.Length]; i >= 0; i = this.entries[i].next)
                    {
                        if ((this.entries[i].hashCode == num) && this.comparer.Equals(this.entries[i].key, key))
                        {
                            return i;
                        }
                    }
                }
                return -1;
            }

    So if we search for “One”:

     

    -         “One” hashes to 1700053362.

    -         num % this.buckets.Length gives us 16.

    -         buckets[16] holds the value 1.

    -         entries[1].hashCode = 1700053362

     

    Then we get the value associated with the key using:

     

    int index = this.FindEntry(key);
    if (index >= 0)
    {
        return this.entries[index].value;
    }
    

     

    Notice how the first entry in each bucket is actually the head of a liked list (i = this.entries[i].next). If the entry is the only entry in the bucket, entries[i].next will be -1. If there are more than one entries in the bucket, entries[i].next will be a positive integer indicating that the bucket contains additional entries, and the loop will be entered again until a matching entry is found.

     

    I’m almost absolutely certain this is the collision resolution strategy known as "chaining". An entry is added to the dictionary. Its then mapped to the bucket its key hashes to. If that bucket already maps to another entry, the existing entry becomes the head of a linked list, and the new entry is prepended to the list.

     

    So that’s how dictionaries work their magic. Instead of having to loop through an entire list doing equality comparisons to locate an item, they only have to loop through the entries in the bucket that the entry hashes to. 

    That's as far as I've got so far. Still need to figure out the 'freelist' piece. Thanks very much for your reply. 
     

    Friday, February 05, 2010 4:29 AM
  • You just restated everything that I stated in my post.

    The differences are purely implementation details.

    1) The dictionary stores a specialized key/value structure that has two extra fields: HashCode and Next.
    2) The buckets are implemented with a linked list, the Next field is used for this implementation.
    3) The hash algorithm produces two values: A quick comparer and a bucket index.

    The HashCode value stored in the entry is used to make fast comparisons. If two items have different HashCodes, they can't be the same item, so this test is very fast. When comparing, FindEntry will perform a thorough comparison only if the search item and the bucket item have the same HashCode.

    The "freelist" is simply a block of pre-allocated structures which the Dictionary uses when adding new items. It allocates blocks of memory at a time, rather than one each time an item is added.

    All of these 'details' are just performance enhancements. Everything I said above is still true.
    • Marked as answer by Roahn Luo Tuesday, February 09, 2010 8:54 AM
    Friday, February 05, 2010 2:17 PM
  • I honestly do appreciate your post, and hope you don't feel I was suggesting any of it was incorrect.

    It's just that the implementation details are precisely what I was interested in, not abstractions.

    I did a deep, deep scan of the entire WWW over three days to try and find a single post that explains how a hash table might be implemented in C#. Aside from uncommented .NET source code, there is nothing. Lots of C++, no C#. System.Collections.Hashtable is pretty well commented on Rotor, but its implementation is very different from Dictionary<TKey, TValue> in fundamental ways: different hash function, different bucket behavior, different collision resolution strategy. So I'm more intimately familiar with how to speak about hash tables as abstractions than I would like to be. One can speak about the abstraction a "collection of buckets" ad nauseum, and still have no clear idea of what is being talked about, or how such a construct might actually be implemented in code. 

    Thank you for the 'freelist' pointer. I havn't stepped through this item carefully yet, and wasn't sure if it referred to free blocks in the entries array, or free slots in the buckets array. I'm still not, but feel you've pointed me in the right direction, and will invetigate it further when I get a chance.

    I do have to chuckle a bit though, and hope you can have a little laugh with me. I just finished an Eric Lippert blog, "References are not addresses" where he speaks the same way, using phrases like "but that is an implementation detail (with italics)". I don't know about you, but genuine comprehension occurs for me when I grasp the "implementation details". What is there about an abstraction to grasp??  Anyway, the blog is fascinating with 26 pages of equally fascinating comments from other developers. One of the more interesting threads running through the blog concerns the relative pedagogical value, when explaining programming constructs, of inaccurate oversimplifications vs. accurate digressions vs. vague hand-waves and the like. I've just been hunting for the accurate digression is all. But inaccurate oversimplifications and vague hand-waves have their place too.

    Thanks again,

    Bob
    Friday, February 05, 2010 5:21 PM
  • As an example of the importance of discussing implementation details in this context, I'll just offer this.

    The reason I couldn't get clear on whether a hash bucket contains 0 or 1 items, or 1 or Many items is because the answer is: it depends on the implementation details.

    System.Collections.Hashtable uses a double-hashing (re-hashing/probing) collision resolution strategy. So each bucket in System.Collections.Hashtable can contain 0 or 1 entries.

    System.Collections.Generic.Dictionary<TKey,TValue> uses a chaining collision resolution strategy. So each bucket in System.Collections.Generic.Dictionary<TKey,TValue> can contain 1 or Many entries (linked list).

    Friday, February 05, 2010 6:03 PM
  • Lots of C++, no C#
    C# is just a language, as is C++. What's important (from a comprehension point of view) is the mechanics of the implementation. All languages allocate memory and use that memory to store information. The grammer used to describe the memory and its elements are not important.

    If you have a solid understanding of C# (or any computer language), you should have little difficulty reading other languages well enough to grasp what they are doing to that memory. I'm not saying this to mock you, rather I would encourage you to read other languages and not shy away just because they are not languages you know well.

    If there is a derth of algorithm and data storage implementations published in C#, that's probably because it's a newer language. C/C++ has been around for almost 40 years now.


    Are you left with any specific questions about implementing a hash table or have you sorted it all out?
    Friday, February 05, 2010 7:12 PM
  • All sorted out I think.

    I suspect I'm going to find, when I step-through the code a bit again, that "freeList" just maintains a list of elements in the entries[] array from which entries have been Removed(). These elements can then be re-used as new entries are Add()[ed].  If I got that right, I think everything is sorted out.

    Thanks again for your time. Not an easy subject matter to sort, but this exchange has helped a lot.

    Best,

    Bob 
    Friday, February 05, 2010 7:39 PM