What does hash do in python?
TL;DR:
Please refer to the glossary: hash()
is used as a shortcut to comparing objects, an object is deemed hashable if it can be compared to other objects. that is why we use hash()
. It's also used to access dict
and set
elements which are implemented as resizable hash tables in CPython.
Technical considerations
- usually comparing objects (which may involve several levels of recursion) is expensive.
- preferably, the
hash()
function is an order of magnitude (or several) less expensive. - comparing two hashes is easier than comparing two objects, this is where the shortcut is.
If you read about how dictionaries are implemented, they use hash tables, which means deriving a key from an object is a corner stone for retrieving objects in dictionaries in O(1)
. That's however very dependent on your hash function to be collision-resistant. The worst case for getting an item in a dictionary is actually O(n)
.
On that note, mutable objects are usually not hashable. The hashable property means you can use an object as a key. If the hash value is used as a key and the contents of that same object change, then what should the hash function return? Is it the same key or a different one? It depends on how you define your hash function.
Learning by example:
Imagine we have this class:
>>> class Person(object):
... def __init__(self, name, ssn, address):
... self.name = name
... self.ssn = ssn
... self.address = address
... def __hash__(self):
... return hash(self.ssn)
... def __eq__(self, other):
... return self.ssn == other.ssn
...
Please note: this is all based on the assumption that the SSN never changes for an individual (don't even know where to actually verify that fact from authoritative source).
And we have Bob:
>>> bob = Person('bob', '1111-222-333', None)
Bob goes to see a judge to change his name:
>>> jim = Person('jim bo', '1111-222-333', 'sf bay area')
This is what we know:
>>> bob == jim
True
But these are two different objects with different memory allocated, just like two different records of the same person:
>>> bob is jim
False
Now comes the part where hash() is handy:
>>> dmv_appointments = {}
>>> dmv_appointments[bob] = 'tomorrow'
Guess what:
>>> dmv_appointments[jim] #?
'tomorrow'
From two different records you are able to access the same information. Now try this:
>>> dmv_appointments[hash(jim)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 9, in __eq__
AttributeError: 'int' object has no attribute 'ssn'
>>> hash(jim) == hash(hash(jim))
True
What just happened? That's a collision. Because hash(jim) == hash(hash(jim))
which are both integers btw, we need to compare the input of __getitem__
with all items that collide. The builtin int
does not have an ssn
attribute so it trips.
>>> del Person.__eq__
>>> dmv_appointments[bob]
'tomorrow'
>>> dmv_appointments[jim]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: <__main__.Person object at 0x7f611bd37110>
In this last example, I show that even with a collision, the comparison is performed, the objects are no longer equal, which means it successfully raises a KeyError
.
A hash is an fixed sized integer that identifies a particular value. Each value needs to have its own hash, so for the same value you will get the same hash even if it's not the same object.
>>> hash("Look at me!")
4343814758193556824
>>> f = "Look at me!"
>>> hash(f)
4343814758193556824
Hash values need to be created in such a way that the resulting values are evenly distributed to reduce the number of hash collisions you get. Hash collisions are when two different values have the same hash. Therefore, relatively small changes often result in very different hashes.
>>> hash("Look at me!!")
6941904779894686356
These numbers are very useful, as they enable quick look-up of values in a large collection of values. Two examples of their use are Python's set
and dict
. In a list
, if you want to check if a value is in the list, with if x in values:
, Python needs to go through the whole list and compare x
with each value in the list values
. This can take a long time for a long list
. In a set
, Python keeps track of each hash, and when you type if x in values:
, Python will get the hash-value for x
, look that up in an internal structure and then only compare x
with the values that have the same hash as x
.
The same methodology is used for dictionary lookup. This makes lookup in set
and dict
very fast, while lookup in list
is slow. It also means you can have non-hashable objects in a list
, but not in a set
or as keys in a dict
. The typical example of non-hashable objects is any object that is mutable, meaning that you can change its value. If you have a mutable object it should not be hashable, as its hash then will change over its life-time, which would cause a lot of confusion, as an object could end up under the wrong hash value in a dictionary.
Note that the hash of a value only needs to be the same for one run of Python. In Python 3.3 they will in fact change for every new run of Python:
$ /opt/python33/bin/python3
Python 3.3.2 (default, Jun 17 2013, 17:49:21)
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> hash("foo")
1849024199686380661
>>>
$ /opt/python33/bin/python3
Python 3.3.2 (default, Jun 17 2013, 17:49:21)
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> hash("foo")
-7416743951976404299
This is to make is harder to guess what hash value a certain string will have, which is an important security feature for web applications etc.
Hash values should therefore not be stored permanently. If you need to use hash values in a permanent way you can take a look at the more "serious" types of hashes, cryptographic hash functions, that can be used for making verifiable checksums of files etc.
The Python docs for hash()
state:
Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup.
Python dictionaries are implemented as hash tables. So any time you use a dictionary, hash()
is called on the keys that you pass in for assignment, or look-up.
Additionally, the docs for the dict
type state:
Values that are not hashable, that is, values containing lists, dictionaries or other mutable types (that are compared by value rather than by object identity) may not be used as keys.
The hash is used by dictionaries and sets to quickly look up the object. A good starting point is Wikipedia's article on hash tables.