When CPython set `in` operator is O(n)?
Load factor is a red herring. In CPython sets (and dicts) automatically resize to keep the load factor under 2/3. There's nothing you can do in Python code to stop that.
O(N)
behavior can occur when a great many elements have exactly the same hash code. Then they map to the same hash bucket, and set lookup degenerates to a slow form of linear search.
The easiest way to contrive such bad elements is to create a class with a horrible hash function. Like, e.g., and untested:
class C:
def __init__(self, val):
self.val = val
def __eq__(a, b):
return a.val == b.val
def __hash__(self):
return 3
Then hash(C(i)) == 3
regardless of the value of i
.
To do the same with builtin types requires deep knowledge of their CPython implementation details. For example, here's a way to create an arbitrarily large number of distinct ints with the same hash code:
>>> import sys
>>> M = sys.hash_info.modulus
>>> set(hash(1 + i*M) for i in range(10000))
{1}
which shows that the ten thousand distinct ints created all have hash code 1.
You can view the set
source here which can help: https://github.com/python/cpython/blob/723f71abf7ab0a7be394f9f7b2daa9ecdf6fb1eb/Objects/setobject.c#L429-L441
It's difficult to devise a specific example but the theory is fairly simple luckily :)
The set stores the keys using a hash
of the value, as long as that hash
is unique enough you'll end up with the O(1)
performance as expected.
If for some weird reason all of your items have different data but the same hash, it collides and it will have to check all of them separately.
To illustrate, you can see the set as a dict like this:
import collection
your_set = collection.defaultdict(list)
def add(value):
your_set[hash(value)].append(value)
def contains(value):
# This is where your O(n) can occur, all values the same hash()
values = your_set.get(hash(value), [])
for v in values:
if v == value:
return True
return False