Python's underlying hash data structure for dictionaries

The only way to be sure would be to implement both and check, but my informed guess is that the dictionary will be faster, because a binary search tree has cost O(log(n)) for lookup and insertion, and I think that except under the most pessimal of situations (such as massive hash collisions) the hash table's O(1) lookup will outweigh the occasional resize.

If you take a look at the Python dictionary implementation, you'll see that:

a dictionary starts out with 8 entries (PyDict_MINSIZE);
a dictionary with 50,000 or fewer entries quadruples in size when it grows;
a dictionary with more than 50,000 entries doubles in size when it grows;
key hashes are cached in the dictionary, so they are not recomputed when the dictionary is resized.

(The "NOTES ON OPTIMIZING DICTIONARIES" are worth reading too.)

So if your dictionary has 1,000,000 entries, I believe that it will be resized eleven times (8 → 32 → 128 → 512 → 2048 → 8192 → 32768 → 131072 → 262144 → 524288 → 1048576 → 2097152) at a cost of 2,009,768 extra insertions during the resizes. This seems likely to be much less than the cost of all the rebalancing involved in 1,000,000 insertions into an AVL tree.

What's the ratio of items vs unique items? What's the expected number of unique items?

If a hash bucket fills, then extending should just be a matter of some memory reallocation, not rehashing.

Testing a counting dict should be very quick and easy to do.

Note also the counter class available since python 2.7 http://docs.python.org/library/collections.html#counter-objects http://svn.python.org/view?view=rev&revision=68559

Python dictionaries are highly optimized. Python makes various special-case optimizations that the Python developers cater for in the CPython dictionary implementation.

In CPython, all PyDictObject's are optimized for dictionaries containing only string keys.
Python's dictionary makes an effort to never be more than 2/3rds full.

The book "Beautiful Code" discusses this all.

The eighteenth chapter is Python's Dictionary Implementation: Being All Things to All People by Adrew Kuchling

It is much better to use it than try to achieve the hand crafted custom implementation which will have to replicate all these optimizations to be any where near the main CPython implementation of dictionary look ups.

Python

Performance

Algorithm

Data Structures

Python's underlying hash data structure for dictionaries

Tags:

Related

Recent Posts