How to combine hash codes in in Python3?
Instead of combining multiple strings together, use tuples as they are hashable in python.
t: Tuple[str, str, int] = ('Field1', 'Field2', 33)
print(t.__hash__())
This will make the code also easier to read.
The python documentation suggests that you use xor to combine hashes:
The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
I'd also recommend xor over addition and multiplication because of this:
Note
hash()
truncates the value returned from an object’s custom__hash__()
method to the size of aPy_ssize_t
. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds. If an object’s__hash__()
must interoperate on builds of different bit sizes, be sure to check the width on all supported builds. An easy way to do this is withpython -c "import sys; print(sys.hash_info.width)
"
This documentation is the same for python 2.7 and python 3.4, by the way.
A note on symmetry and xoring items with themselves.
As pointed out in the comments, xor is symmetric, so order of operations disappears. The XOR of two the same elements is also zero. So if that's not desired mix in some rotations or shifts, or, even better, use this solution's suggestion of taking the hash of a tuple of the identifying elements. If you don't want to preserve order, consider using the frozenset
.
The easiest way to produce good hashes is to put your values in a standard hashable Python container, then hash that. This includes combining hashes in subclasses. I'll explain why, and then how.
Base requirements
First things first:
- If two objects test as equal, then they MUST have the same hash value
- Objects that have a hash, MUST produce the same hash over time.
Only when you follow those two rules can your objects safely be used in dictionaries and sets. The hash not changing is what keeps dictionaries and sets from breaking, as they use the hash to pick a storage location, and won't be able to locate the object again given another object that tests equal if the hash changed.
Note that it doesn’t even matter if the two objects are of different types; True == 1 == 1.0
so all have the same hash and will all count as the same key in a dictionary.
What makes a good hash value
You'd want to combine the components of your object value in ways that will produce, as much as possible, different hashes for different values. That includes things like ordering and specific meaning, so that two attributes that represent different aspects of your value, but that can hold the same type of Python objects, still result in different hashes, most of the time.
Note that it's fine if two objects that represent different values (won't test equal) have equal hashes. Reusing a hash value won't break sets or dictionaries. However, if a lot of different object values produce equal hashes then that reduces their efficiency, as you increase the likelihood of collisions. Collisions require collision resolution and collision resolution takes more time, so much so that you can use denial of service attacks on servers with predictable hashing implementations) (*).
So you want a nice wide spread of possible hash values.
Pitfalls to watch out for
The documentation for the object.__hash__
method includes some advice on how to combine values:
The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
but only using XOR will not produce good hash values, not when the values whose hashes that you XOR together can be of the same type but have different meaning depending on the attribute they've been assigned to. To illustrate with an example:
>>> class Foo:
... def __init__(self, a, b):
... self.a = a
... self.b = b
... def __hash__(self):
... return hash(self.a) ^ hash(self.b)
...
>>> hash(Foo(42, 'spam')) == hash(Foo('spam', 42))
True
Because the hashes for self.a
and self.b
were just XOR-ed together, we got the same hash value for either order, and so effectively halving the number of usable hashes. Do so with more attributes and you cut the number of unique hashes down rapidly. So you may want to include a bit more information in the hash about each attribute, if the same values can be used in different elements that make up the hash.
Next, know that while Python integers are unbounded, hash values are not. That is to say, hashes values have a finite range. From the same documentation:
Note:
hash()
truncates the value returned from an object’s custom__hash__()
method to the size of aPy_ssize_t
. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds.
This means that if you used addition or multiplication or other operations that increase the number of bits needed to store the hash value, you will end up losing the upper bits and so reduce the number of different hash values again.
Next, if you combine multiple hashes with XOR that already have a limited range, chances are you end up with an even smaller number of possible hashes. Try XOR-ing the hashes of 1000 random integers in the range 0-10, for an extreme example.
Hashing, the easy way
Python developers have long since wrestled with the above pitfalls, and solved it for the standard library types. Use this to your advantage. Put your values in a tuple, then hash that tuple.
Python tuples use a simplified version of the xxHash algorithm to capture order information and to ensure a broad range of hash values. So for different attributes, you can capture the different meanings by giving them different positions in a tuple, then hashing the tuple:
def __hash__(self):
return hash((self.a, self.b))
This ensures you get unique hash values for unique orderings.
If you are subclassing something, put the hash of the parent implementation into one of the tuple positions:
def __hash__(self):
return hash((super().__hash__(), self.__more_data))
Hashing a hash value does reduce it to a 60-bit or 30-bit value (on 32-bit or 64-bit platforms, respectively), but that's not a big problem when combined with other values in a tuple. If you are really concerned about this, put None
in the tuple as a placeholder and XOR the parent hash (so super().__hash__() ^ hash((None, self.__more_data))
). But this is overkill, really.
If you have a multiple values whose relative order doesn't matter, and don't want to XOR these all together one by one, consider using a frozenset()
object for fast processing, combined with a collections.Counter()
object if values are not meant to be unique. The frozenset()
hash operation accounts for small hash ranges by reshuffling the bits in hashes first:
# unordered collection hashing
from collections import Counter
hash(frozenset(Counter(...).items()))
As always, all values in the tuple or frozenset()
must be hashable themselves.
Consider using dataclasses
For most objects you write __hash__
functions for, you actually want to be using a dataclass
generated class:
from dataclasses import dataclass
from typing import Union
@dataclass(frozen=True)
class Foo:
a: Union[int, str]
b: Union[int, str]
Dataclasses are given a sane __hash__
implementation when frozen=True
or unsafe_hash=True
, using a tuple()
of all the field values.
(*) Python protects your code against such hash collision attacks by using a process-wide random hash seed to hash strings, bytes and datetime
objects.