What is the benefit of having a cryptographically secure hash algorithm in hashmaps?
Sometimes applications use untrusted data as the key in a hash map. A simple implementation can allow the untrusted data to cause a denial of service attack.
Hash maps are fast - O(1) - in the best case, but slow - O(n) - in the worst case. This is because keys are normally in separate buckets, but some values can result in the same hash - a collision - which are handled by a slower linked list. With random data, collisions will be uncommon. However, some implementation have a vulnerability where malicious data can cause many collisions, which makes the hash map slow. Some years ago there was a Linux kernel DoS due to this.
The root cause of the Linux vulnerability was that hashing was predictable. It was fixed by introducing a key into the hash function that a remote user would not know. I don't know exactly how Rust hash maps work, but I expect they use a similar kind of keyed hash.
You should opt for a more secure hash any time you're using untrusted data as a key.
Insert, search, and remove operations on hash tables have O(n) worst case behavior. If an attacker can choose keys to be inserted into a hash table and they can compute the hash function themselves then that creates an opportunity for denial of service. All they need to do is choose keys that map to the same bucket.
The quote suggests that the use of a cryptographic hash algorithm (SHA, MD5, Blake, Skein, etc.) solves the problem. That interpretation is totally incorrect. The algorithm that Rust's HashMap uses is called SipHash. It is a hash algorithm. And it is a cryptographic algorithm. But it is not a cryptographic-hash-function. The correct term for SipHash in the world of cryptography is PRF.
The key difference is that (in cryptography) all the details of a hash function may be public knowledge. A PRF, on the other hand, requires a secret key. Without the secret information there is no way to anticipate, for any input, what the output will be. (All the other details are public.)
Something like SHA-2 will not prevent denial of service. It will be totally unbiased for non-adversarial inputs. (Because cryptographic hash functions can be modelled as random oracles.) However anyone can evaluate plain SHA-2, so someone can find hash table collisions by brute force.
The fact that a cryptographic hash function is collision resistant (with at least 256-bit output) doesn't translate to a lack of collisions in the case of hash tables. Ultimately your hash function, for a table with n buckets, will be reduced to one of n possible values. By trial and error you can find an input that maps to a specific bucket about once every n tries. No hash tables use enough buckets to make this impossible.
Using an un-keyed hash function is inherently vulnerable to denial of service, no matter how good the hash function is. The fact that the attacker and server-with-a-hash-map both query the same oracle enables a DOSer to use specifically chosen inputs to tie up your CPU.
PRFs like SipHash don't have this vulnerability if used correctly. The server uses an oracle/function chosen from a pool of 2128 possible functions. To exploit a PRF based (hash-table-)hash-function the attacker either needs to guess which of the 2128 functions he should be using (a "key recovery") or to find a bias in the PRF independent of the key (a way to distinguish the PRF from a random oracle).
Finally, there exists more confusing nuances involving hash algorithms. But summarized simply:
- Cryptographic hash functions are a subset of all ordinary hash functions
- Under the classical definition of cryptographic hash function the randomness isn't required. However randomness is a feature of all big-name cryptographic hash functions anyway.
- Not all PRFs are cryptographic hash functions
- Not all cryptographic hash functions are PRFs
- An algorithm can have the properties of a PRF and a cryptographic hash function.
- Blake2, Skein, and KMAC have both sets of properties
- The SHA-2 and SHA-3 families are examples of (unkeyed) cryptographic hash functions
- SipHash is only a PRF (and an ordinary, but not cryptographic, hash function)
- A PRF can be constructed using typical cryptographic hash functions, but the hash function itself is not necessarily a PRF.
- "Randomized hashing" and "universal hashing" are similar to PRFs in some ways, but they don't have the same security requirements.
- Collisions can be created even if MurmurHash3's seed is randomized
I agree that's a little vague and it's going to heavily depend on how the hashmaps are used.
Here's my guess: say you're taking some input from users, say [Firstname.Lastname]
and using that as the lookup value in your hash table. Let's say you're building your hash table using the simple hash function that takes initials so that [Firstname.Lastname] --> FL
then it would be easy for an attacker to submit loads of values that all hash to the same thing. That would essentially turn your hashtable into a list which negates all the performance gains of using a hashtable. Slow lookups = denial of service.
AA -> [ ]
AB -> [ ]
...
FK -> [ ]
FL -> [First.Last, F1.F2, F1.F2, Fanotheu.Lonteuh, ...]
FM -> [ ]
...
ZZ -> [ ]
Cryptographic hash functions are specifically designed to prevent this because it's very hard to construct two different inputs that have the same hash value (called collisions).
When should I opt for a more secure hash over a faster one?
The answer is simple: opt for a cryptographic hash any time the lookup value is supplied by users and could be maliciously crafted. If the lookup values are coming from some internal source that you trust to be non-malicious and evenly distributed, then you can use a faster hash.