Why is XOR the default way to combine hashes?
In spite of its handy bit-mixing properties, XOR is not a good way to combine hashes due to its commutativity. Consider what would happen if you stored the permutations of {1, 2, …, 10} in a hash table of 10-tuples.
A much better choice is m * H(A) + H(B)
, where m is a large odd number.
Credit: The above combiner was a tip from Bob Jenkins.
xor
is a dangerous default function to use when hashing. It is better than and
and or
, but that doesn't say much.
xor
is symmetric, so the order of the elements is lost. So "bad"
will hash combine the same as "dab"
.
xor
maps pairwise identical values to zero, and you should avoid mapping "common" values to zero:
So (a,a)
gets mapped to 0, and (b,b)
also gets mapped to 0. As such pairs are almost always more common than randomness might imply, you end up with far to many collisions at zero than you should.
With these two problems, xor
ends up being a hash combiner that looks half decent on the surface, but not after further inspection.
On modern hardware, adding usually about as fast as xor
(it probably uses more power to pull this off, admittedly). Adding's truth table is similar to xor
on the bit in question, but it also sends a bit to the next bit over when both values are 1. This means it erases less information.
So hash(a) + hash(b)
is better than hash(a) xor hash(b)
in that if a==b
, the result is hash(a)<<1
instead of 0.
This remains symmetric; so the "bad"
and "dab"
getting the same result remains a problem. We can break this symmetry for a modest cost:
hash(a)<<1 + hash(a) + hash(b)
aka hash(a)*3 + hash(b)
. (calculating hash(a)
once and storing is advised if you use the shift solution). Any odd constant instead of 3
will bijectively map a "k
-bit" unsigned integer to itself, as map on unsigned integers is math modulo 2^k
for some k
, and any odd constant is relatively prime to 2^k
.
For an even fancier version, we can examine boost::hash_combine
, which is effectively:
size_t hash_combine( size_t lhs, size_t rhs ) {
lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
return lhs;
}
here we add together some shifted versions of lhs
with a constant (which is basically random 0
s and 1
s – in particular it is the inverse of the golden ratio as a 32 bit fixed point fraction) with some addition and an xor. This breaks symmetry, and introduces some "noise" if the incoming hashed values are poor (ie, imagine every component hashes to 0 – the above handles it well, generating a smear of 1
and 0
s after each combine. My naive 3*hash(a)+hash(b)
simply outputs a 0
in that case).
(For those not familiar with C/C++, a size_t
is an unsigned integer value which is big enough to describe the size of any object in memory. On a 64 bit system, it is usually a 64 bit unsigned integer. On a 32 bit system, a 32 bit unsigned integer.)
Assuming uniformly random (1-bit) inputs, the AND function output probability distribution is 75% 0
and 25% 1
. Conversely, OR is 25% 0
and 75% 1
.
The XOR function is 50% 0
and 50% 1
, therefore it is good for combining uniform probability distributions.
This can be seen by writing out truth tables:
a | b | a AND b
---+---+--------
0 | 0 | 0
0 | 1 | 0
1 | 0 | 0
1 | 1 | 1
a | b | a OR b
---+---+--------
0 | 0 | 0
0 | 1 | 1
1 | 0 | 1
1 | 1 | 1
a | b | a XOR b
---+---+--------
0 | 0 | 0
0 | 1 | 1
1 | 0 | 1
1 | 1 | 0
Exercise: How many logical functions of two 1-bit inputs a
and b
have this uniform output distribution? Why is XOR the most suitable for the purpose stated in your question?
Xor may be the "default" way to combine hashes but Greg Hewgill's answer also shows why it has its pitfalls: The xor of two identical hash values is zero. In real life, there are identical hashes are more common than one might have expected. You might then find that in these (not so infrequent) corner cases, the resulting combined hashes are always the same (zero). Hash collisions would be much, much more frequent than you expect.
In a contrived example, you might be combining hashed passwords of users from different websites you manage. Unfortunately, a large number of users reuse their passwords, and a surprising proportion of the resulting hashes are zero!