How is set() implemented?
According to this thread:
Indeed, CPython's sets are implemented as something like dictionaries with dummy values (the keys being the members of the set), with some optimization(s) that exploit this lack of values
So basically a set
uses a hashtable as its underlying data structure. This explains the O(1)
membership checking, since looking up an item in a hashtable is an O(1)
operation, on average.
If you are so inclined you can even browse the CPython source code for set
which, according to Achim Domma, was originally mostly a cut-and-paste from the dict
implementation.
Note: Nowadays, set
and dict
's implementations have diverged significantly, so the precise behaviors (e.g. arbitrary order vs. insertion order) and performance in various use cases differs; they're still implemented in terms of hashtables, so average case lookup and insertion remains O(1)
, but set
is no longer just "dict
, but with dummy/omitted keys".
When people say sets have O(1) membership-checking, they are talking about the average case. In the worst case (when all hashed values collide) membership-checking is O(n). See the Python wiki on time complexity.
The Wikipedia article says the best case time complexity for a hash table that does not resize is O(1 + k/n)
. This result does not directly apply to Python sets since Python sets use a hash table that resizes.
A little further on the Wikipedia article says that for the average case, and assuming a simple uniform hashing function, the time complexity is O(1/(1-k/n))
, where k/n
can be bounded by a constant c<1
.
Big-O refers only to asymptotic behavior as n → ∞. Since k/n can be bounded by a constant, c<1, independent of n,
O(1/(1-k/n))
is no bigger than O(1/(1-c))
which is equivalent to O(constant)
= O(1)
.
So assuming uniform simple hashing, on average, membership-checking for Python sets is O(1)
.