How does the process of hashing work in Dictionary<TKey, TValue>

A hash table or dictionary is a data structure that stores key-value pairs. The advantage of the hash table is that given a key finding the corresponding value is pretty fast. Simplified, the time to find a key-value pair in the hash table does not depend on the size of the table. Compare that to storing the key-value pairs in a list or an array. To find a key-value pair you would have to search the list from the beginning until a matching key was found. The longer the list the more time it would take to find the key-value pair. Using big-O notation you can say that looking up a key in a hash table is of order O(1) while looking up a key in a list by using linear search is of order O(N) (simplified).

To insert a key-value pair in the hash table you will first have to compute the hash code of the key. In .NET all objects have a method named GetHashCode that returns a hash code (32 bit integer) for that particular object. It is important that equal objects return the same hash code, but also very useful if different objects return different hash codes. Beware of the misconception that different objects cannot return the same hash code - they can, but it will result in a collision (see below).

As an example consider the hash codes of two strings:

"Boo" 0x598FD95A
"Foo" 0x598FD8DE

Even though the strings are very similar they have different hash codes.

I am simplifying things a bit here to focus on the important aspects of a hash table so for now let us say that Internally Dictionary<TKey, TValue> stores the key-value pairs in an array. To locate the index in this array where the key-value pair will be stored you have to compute the hash code of the key modulo the size of the array. Assume the size of the array is 5:

Index("Boo") = 0x598FD95A % 5 = 4
Index("Foo") = 0x598FD8DE % 5 = 0

This leads to this internal hash table array:

+---+---------+
| 0 | "Foo"   |
+---+---------+
| 1 | (empty) |
+---+---------+
| 2 | (empty) |
+---+---------+
| 3 | (empty) |
+---+---------+
| 4 | "Boo"   |
+---+---------+

Looking up an entry in the hash table is very fast. You simply have to compute the hash code of the key modulo the size of the internal array and retrieve the string at that index.

Now consider the key "Zoo":

Index("Zoo") = 0x598FDC62 % 5 = 0

It has the same index as the key "Foo". This results in what is called a collision. A proper implementation of a hash table will have to handle collisions and there are different strategies for doing that. Also, as the internal array fills up there will be fewer and fewer empty elements in the array resulting in an increasing number of collisions. The load factor is the ratio between used elements and total elements in the internal array. In the example above the load factor is 2/5 = 0.4. Most hash table implementations will increase the size of the internal array when the load factor exceeds a certain threshold.

If you want to learn more about some of these concepts you will have to study some of the more comprehensive resources linked in other answers.


The hashing process in an Dictionary uses a technique that's refered to as chaining. With chaining, a secondary data structure is utilized to hold any collisions. Specifically, each slot in the Dictionary has an array of elements that map to a bucket. In the event of a collision, the colliding element is prepended to the bucket's list.

See this article on MSDN for more details.


By using a Computer Science concept called a Hash Map. This does work faster than searching a list. This works by keeping the search from needing to iterate through a list until it finds a match. Instead the key is "hashed", and used as an index into a list. This hashing function is almost always faster than searching the list (iterating with multiple comparisons).

Tags:

C#

.Net