How does Dictionary use the Equatable protocol in Swift?
While the question has already been answered, those answers raised some additional questions in the comments. @Suragch asked if I would include my comments in a new answer to help others who might also be confused by the relationship between Equatable
and Hashable
. I have to start by saying that I only have a rudimentary understanding of the underlying mechanics, but I'll do my best to explain what I know.
Equatable
is a fairly simple concept, and we need look no further than the Swift docs for a succinct definition of this protocol:
Equatable: A type that can be compared for value equality.
If a type is equatable, we can compare two instances with ==
. Easy.
Hashable
is a different story. I actually laughed out loud when I first read the definition of this protocol in the Swift docs:
Hashable: A type that can be hashed into a Hasher to produce an integer hash value.
If that hasn't cleared it up for you, you're not alone! And anyway, if ==
is used to determine if an instance is truly unique (which it needs to be in a set or when used as a key in a dictionary), why do we need Hashable
at all? (This is the question @Suragch asked in the comments.)
That question goes to the fundamental nature of hashed collections (or hash tables) like sets and dictionaries. Consider why you might choose a dictionary over an array in the first place. You choose a dictionary when you need to find or refer to an instance by something other than a known index, right? Unlike an array, a dictionary's elements aren't numbered sequentially, which makes finding stuff harder. If all we had was ==
, we'd have to iterate over our dictionary's elements, one by one, and that would get slower and slower as the size of the dictionary increased.
This is where the magic of hash functions come in. The hash function (or hasher) takes a unique key as an argument and returns the element's address. How can you be sure it will return the correct address? Because it's the same function that was used to set that address in the first place! When the dictionary was created, it took each key (or rather, the uniquely identifying properties of each key), mashed them up according to some secret formula, and spat out a number (hash value) for each one, and from those numbers came the new indexes for each element. Later, when you want to look up one of those elements, the hasher gets the same arguments, so it returns the same value. And because all you're doing is calling a function, there's no iteration involved and the results are fast, no matter how large the collection gets.
There's a catch though. No hasher is perfect. Even if you feed it unique arguments, occasionally it may return the same hash value for two completely different elements—a hash collision. When that happens, it needs to check if the two elements really are the same, and it does that, of course, by calling ==
.
But in your example, you manipulated hashValue
directly (which was the kind of thing people did before hash(into:)
came along!) and it still called ==
! I mean, in theory, it shouldn't need to do this, since there aren't any collisions. But the answer is there in the comment quoted by robinkunde:
In hashed collections, collisions can occur whenever the number of buckets is smaller than the keyspace
While generally, we don't need to worry about the implementation details of Swift's built-in hasher function, this particular detail is important… Behind the scenes, the hasher needs another argument, and that is the size of the collection. If the size changes (as it does repeatedly when you're iterating over a range and inserting new elements into the collection), the hasher can either try to shove in more elements than there are indexed slots (or buckets) and you get a collision, or it needs to rehash everything from scratch with enough memory to give every element a unique index. What appears to happen is a combination of both, as the comment quoted by matt says:
We try to insert the element before evaluating if the Dictionary can fit that element. This is because the element might already be in the Dictionary, in which case we don't need any more capacity.
So that's my attempt at a simpler explanation of hashed collections, the relationship between the hash function and your ==
method, and the reasons for the unexpected behaviour. But this all raises one more question for me… Why do we need to manually declare Hashable
? Couldn't Apple have baked in some kind of algorithm to synthesize Hashable conformance for all Equatable types? I mean, the hash(into:)
documentation says:
The components used for hashing must be the same as the components compared in your type’s == operator implementation.
If the components need to be the same, couldn't Swift infer our intention from our implementation of Equatable
alone? I'm not sure why it couldn't offer such a convenience (similar to the way it offers default initialisers), for those who don't want more control over the details. Perhaps one day Swift will offer this? For now though, they kept them as separate concerns, with Hashable
inheriting from Equatable
.
I'm copying my answer from bugs.swift.org here. It talks about Sets but the details apply to Dictionaries in the same way.
In hashed collections, collisions can occur whenever the number of buckets is smaller than the keyspace. When you're creating a new Set without specifying a minimum capacity, the set might have only one bucket, so when you're inserting the second element, a collision occurs. The insert method will then decide if the storage should be grown, using something called a load factor. If the storage was grown, the existing elements have to be migrated over to the new storage buffer. That's when you're seeing all those extra calls to hashValue when inserting 4.
The reason you're still seeing more calls to == than you would expect if the number of buckets is equal or higher to the number of elements has to do with an implementation detail of the bucket index calculation. The bits of the hashValue are mixed or "shuffled" before the modulo operation. This is to cut down on excessive collisions for types with bad hash algorithms.
Well, there's your answer:
https://bugs.swift.org/browse/SR-3330?focusedCommentId=19980&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-19980
What's actually happening:
- We hash a value only once on insertion.
- We don't use hashes for comparison of elements, only ==. Using hashes for comparison is only reasonable if you store the hashes, but that means more memory usage for every Dictionary. A compromise that needs evaluation.
- We try to insert the element before evaluating if the Dictionary can fit that element. This is because the element might already be in the Dictionary, in which case we don't need any more capacity.
- When we resize the Dictionary, we have to rehash everything, because we didn't store the hashes.
So what you're seeing is:
- one hash of the search key
- some =='s (searching for a space)
- hashes of every element in the collection (resize)
- one hash of the search key (actually totally wasteful, but not a big deal considering it only happens after an Oð reallocation)
- some =='s (searching for a space in the new buffer)
We all had it totally wrong. They don't use the hashes at all — only ==
— to decide whether this is a distinct key. And then there is a second round of calls in the situation where the collection is grown.