Curious memory consumption of pandas.unique()
Let's see...
pandas.unique
says it's a "hash-table based unique".
It calls this function to acquire the correct hash table implementation for your data, namely htable.Int64HashTable
.
The hash table is initialized with size_hint
= the length of your value vector. That means kh_resize_DTYPE(table, size_hint)
gets called.
Those functions are defined (templated) here in khash.h
.
It seems to allocate (size_hint >> 5) * 4 + (size_hint) * 8 * 2
bytes of memory for the buckets (maybe more, maybe less, I might be off here).
Then, HashTable.unique()
is called.
It allocates an empty Int64Vector
, which seem to quadruple their size whenever they get filled, starting from 128.
It then iterates over your values, figuring out whether they're in the hash table; if not, they get added to both the hash table and the vector. (This is where the vector may grow; the hash table shouldn't need to grow due to the size hint.)
Finally, a NumPy ndarray
is made to point at the vector.
So uh, I think you're seeing the vector size quadrupling at certain thresholds (which should be, if my late-night math stands,
>>> [2 ** (2 * i - 1) for i in range(4, 20)]
[
128,
512,
2048,
8192,
32768,
131072,
524288,
2097152,
8388608,
33554432,
134217728,
536870912,
2147483648,
8589934592,
34359738368,
137438953472,
...,
]
Hope this sheds some light at things :)