.NET Dictionary, impressively fast but how does it work?
The dictionary<T,T>
in .Net is a data structure called a hash table:
On Hash Table and .Net Dictionary:
http://en.wikipedia.org/wiki/Hash_table
http://msdn.microsoft.com/en-us/library/4yh14awz.aspx
http://www.cs.auckland.ac.nz/~jmor159/PLDS210/hash_tables.html
On Binary Search:
http://en.wikipedia.org/wiki/Binary_search
You're right, it uses more memory than an array to retrieve data. That's the trade off you pay for faster access. (This is true in most cases, when you start taking into account the setup time to build a hash table vs an array, at times a sorted array could be faster for setup time and access. In general this is a valid assumption though.)
Not so long ago I've sworn on my mother's grave to bring a detailed answer to this question, it took me quite a while since some of the details and concepts were a bit rusty on my end but, anyhoo without any further ado, here it goes:
How the .NET Dictionary works in length, or kind of...
First things first, let's start off with the concept, like so many other answers have already pointed out, the Dictionary<TKey, TValue>
is a generic (in the sense of the C# language feature) implementation of a hash table.
A hash table is simply an associative array, that is when you pass a pair of (key, value), then the key is used to compute a hash code which itself is gonna help to compute the location of a memory slot (called a bucket) in an underlying storage array (called... buckets) in which the pair you just passed and some other additional information will be saved. This is usually achieved by computing the modulo %
of the hash code on the size of the array / buckets: hashCode % buckets.Length
.
This sort of associative array has an average complexity of O(1) (ie. constant time) for search, insertion and deletion... except under certain circumstances that we will dig into later on. So generally speaking it's much faster to lookup something in a dictionary than say in a list or an array since you don't have to ~normally~ iterate through all the values.
If you have paid attention to what I have said until now, you will have noticed that there might already be an issue. What if the hash code computed based on our key is exactly the same as say another one? or worse a bunch of others keys? which basically mean we could just end up on the same location? How do we manage those collisions? Well obviously very smart folks already have given thought to this particular problem like decades ago and came up with essentially 2 main ways of solving collisions:
- Separate Chaining: basically the pair are stored in a different storage than the buckets (often called entries), for example for each bucket (each index computed) we can have a list of entries which stores the different values which have been stored at the same "index" (due to the same hashcode), basically in case of collisions you have to iterate through the keys (and find another way, other than the hashcode to to distinguish them)
- Open Addressing: everything is stored in the buckets and based on the first bucket found we check next , it also exist different schemes in the way to probe the values Linear Probing, Quadratic Probing, Double Hashing etc.)
The implementation of either of the collision resolution rules can sometimes vary a great deal. In the case of the .NET Dictionary, the data structure relies on a Separate Chaining type of collision resolution like we will see a few minutes.
Ok now let's look at how things are inserted in the .NET Dictionary<TKey, TValue>
which boils down to go through code of the method below:
private void Insert(TKey key, TValue value, bool add)
Note: after reading the insertion steps below, you can figure out the rationale behind deletion and lookup operations by inspecting the code given as a link in the sources at the bottom of my answer.
Step 1: Give me the hash code
There are two ways the hash code of the TKey
key can be computed:
One relies on the default
IEqualityComparer<TKey>
implementation comparer if you don't pass any as a parameter ofDictionary<TKey, TValue>
which basically is generated byEqualityComparer<TKey>.Default
(implementation available here), in case of TKey being a type different from all the usual stuff (like primitives and string) like a custom type, theIEqualityComparer<in TKey>
will leverage the implementation (including theoverride
s) of:bool Equals(object obj)
int GetHashCode()
The other, well, relies on the implementation of
IEqualityComparer<in TKey>
you can pass to theDictionary<TKey, TValue>
constructor.
The interface IEqualityComparer<in T>
looks like that:
// The generic IEqualityComparer interface implements methods to if check two objects are equal
// and generate Hashcode for an object.
// It is use in Dictionary class.
public interface IEqualityComparer<in T>
{
bool Equals(T x, T y);
int GetHashCode(T obj);
}
Either way, the dictionary ends up having a first hash code using the comparer: comparer.GetHashCode()
Step 2: Get the target bucket
The hash code we got from our TKey
key through the IEqualityComparer<in T>
might be sometimes negative which is not really helpful if we want to get a positive index for an array...
What happens is that in order to get rid of negative values the Int32
hashcode got by the comparer.GetHashCode()
is "ANDed" with the Int32.MaxValue
(ie. 2147483647
or 0x7FFFFFFF
) (in the sense of the boolean logic: bits):
var hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
The target bucket (the index) is obtained as follows:
var targetBucket = hashCode % buckets.Length
Will also see in a moment how the buckets
array is resized.
The buckets
(int[]
) is a private
field of the Dictionary<TKey, TValue>
containing the indexes of of the first related slot in the entries
field which is Entry[]
, with Entry
being defined as follows:
private struct Entry
{
public int hashCode;
public int next;
public TKey key;
public TValue value;
}
The key
, value
and hashcode
are self-explanatory fields, regarding the next
field, it basically indicates an index if there is another item in that chain (ie. several keys with the same hashcode), if that entry is the last item of a chain then the next
field is set to -1
.
Note: the hashCode
field in the Entry
struct
is the one after negative value adjustment.
Step 3: check if there is already an entry
At that stage it is important to note that the behaviour differs depending on whether you are updating (add = false
) or strictly inserting (add = true
) a new value.
We will now check the entries related to the targetBucket
starting with the first entry which is can be given by:
var entryIndex = buckets[targetBucket];
var firstEntry = entries[entryIndex];
The actual (simplified) source code with comments:
// Iterating through all the entries related to the targetBucket
for (var i = buckets[targetBucket]; i >= 0; i = entries[i].next)
{
// Checked if all
if (entries[i].hashCode == hashCode &&
comparer.Equals(entries[i].key, key))
{
// If update is not allowed
if (add)
{
// Argument Exception:
// "Item with Same Key has already been added" thrown =]
ThrowHelper.ThrowArgumentException(ExceptionResource.Argument_AddingDuplicate);
}
// We update the entry value
entries[i].value = value;
// Modification while iterating check field
version++;
return;
}
}
Note: the version
field is field also used in other common .NET data structures (eg. List<T>
) that helps detecting while iterating (on MoveNext()
) (and throw the related exception).
Step 4: check if the arrays need to be resized
// The entries location in which the data will be inserted
var index = 0;
// The freeCount field indicates the number of holes / empty slotes available for insertions.
// Those available slots are the results of prior removal operations
if (freeCount > 0)
{
// The freeList field points to the first hole (ie. available slot) in the entries
index = freeList;
freeList = entries[index].next;
// The hole is no longer available
freeCount--;
}
else
{
// The entries array is full
// Need to resize it to make it bigger
if (count == entries.Length)
{
Resize();
targetBucket = hashCode % buckets.Length;
}
index = count;
count++;
}
Note: the about Resize()
call:
- It double-ish the size of the underlying arrays (
buckets
andentries
), like many others collections in order to avoid too many resize operations by providing ample space whenever the collection is resized. - Double->>ish<< since the size can only be a prime number from a precomputed set of
Int32
eg.3
,7
,11
,17
, ...,7199369
since using a prime can reduce the number of collisions (see that answer on SO)
Actually early in the Resize()
method, the new size is computed as follows:
public static int ExpandPrime(int oldSize)
{
var min = 2 * oldSize;
if ((uint) min > 2146435069U && 2146435069 > oldSize)
{
return 2146435069;
}
return HashHelpers.GetPrime(min);
}
Step 5: Add the entry
Since the dictionary is done checking holes and size, it can then finally add the entry using the computed hashCode
, key
, value
and the right index
that has just been calculated and adjust the target bucket accordingly:
entries[index].hashCode = hashCode;
// If the bucket already contained an item, it will be the next in the collision resolution chain.
entries[index].next = buckets[targetBucket];
entries[index].key = key;
entries[index].value = value;
// The bucket will point to this entry from now on.
buckets[targetBucket] = index;
// Again, modification while iterating check field
version++;
Bonus: string special treatment
Quoted from the CodeProject source listed below:
In order to make sure that each 'get' and 'add' operations will not go over more than 100 items for each bucket, a collision counter is being used.
If while traversing the array to find or add an item the collision counter goes over 100 (limit is hard-coded) and the
IEqualityComparer
is of typeEqualityComparer<string>.Default
, a newIEqualityComparer<string>
instance is being generated for alternative string hashing algorithm.If such provider is found, the dictionary will allocate new arrays and copy the content to the new arrays using the new hash code and equality provider.
This optimization might be useful for a scenario where somehow your string keys are not being distributed evenly, but could also lead to massive allocations and waste of CPU time for generating new hash codes of what could be a lot of items in the dictionary.
Usage
Whenever you use a custom type as a key, don't forget to have implement the IEqualityComparer
interface or to override the two Object methods (hashcode + equal) in order to not shoot yourself in the foot upon insertions.
Not only you'll avoid some bad and nasty surprises, but you can also control the distribution of items you insert. By having evenly distributed hashcodes you can avoid chaining too many items and so wasting time iterating on the related entries.
Side note for interviewees/ers
I would like to put emphasis on the fact knowing those implementation details for an interview is usually not a big deal (the actual implementation differs from some versions of .NET ("Regular" or Core...) plus might still be subject to changes at a later point in time)).
If someone would have asked me the question, I would either say:
- The answer you're looking for is on StackOverflow :)
- The answer you're looking for is also either on
- .NET Reference
- Github - .NET Core
- The answer you're looking for does not need implementation details and the official documentation here will suffice for most use cases.
Unless, unless... you are supposed to implement a hash tables slash dictionaries all by yourself in your day-to-day job, in which case that sort of knowledge (ie. impl. details) may come in handy or even mandatory.
Sources:
- .NET Reference - Source Code
- Github - .NET Core - Source Code
- Reddit - Eli5: how double hashing works in .NET dictionaries?
- SO - How double hashing works in case of the .NET Dictionary?
- Code Project - Understanding Generic Dictionary in-depth
- Mark Vincze's coding blog - Back to basics: Dictionary part 1, hash tables
- Mark Vincze's coding blog - Back to basics: Dictionary part 2, .NET implementation
The basic principle is:
- Set up empty array.
- Obtain hash-code.
- Re-hash hash to fit size of array (e.g. if the array is 31 items in size, we can do
hash % 31
) and use this as an index.
Retrieval is then a matter of finding the index in the same way, obtaining the key if it's there, and calling Equals
on that item.
The obvious issue here is what to do if there are two items that belong at the same index. One approach is that you store a list or similar in the array rather than the key-value pair itself, another is "reprobing" into a different index. Both approaches have advantages and disadvantages, and Microsoft use reprobing a list.
Above a certain size, the amount of reprobing (or the size of the stored lists if you took that approach) gets too large and the near-O(1) behaviour is lost, at which point the table is resized so as to improve this.
Clearly though, a really poor hash algorithm can destroy this, you can demonstrate this to yourself by building a dictionary of objects where the hashcode method is the following:
public override int GetHashCode()
{
return 0;
}
This is valid, but horrible, and turns your near-O(1) behaviour into O(n) (and bad even as O(n) goes.
There are plenty of other details and optimisations, but the above is the basic principle.
Edit:
Incidentally, if you have a perfect hash (you know all possible values, and have a hash method that gives each such value a unique hash in a small range) it's possible to avoid the issues of reprobing that occur with more general-purpose hash-tables, and just treat the hash as an index into an array. This gives both the O(1) behaviour, and minimal memory use, but only applies when all possible values are in a small range.