Curious memory consumption of pandas.unique()

Let's see...

pandas.unique says it's a "hash-table based unique".

It calls this function to acquire the correct hash table implementation for your data, namely htable.Int64HashTable.

The hash table is initialized with size_hint = the length of your value vector. That means kh_resize_DTYPE(table, size_hint) gets called.

Those functions are defined (templated) here in khash.h.

It seems to allocate (size_hint >> 5) * 4 + (size_hint) * 8 * 2 bytes of memory for the buckets (maybe more, maybe less, I might be off here).

Then, HashTable.unique() is called.

It allocates an empty Int64Vector, which seem to quadruple their size whenever they get filled, starting from 128.

It then iterates over your values, figuring out whether they're in the hash table; if not, they get added to both the hash table and the vector. (This is where the vector may grow; the hash table shouldn't need to grow due to the size hint.)

Finally, a NumPy ndarray is made to point at the vector.

So uh, I think you're seeing the vector size quadrupling at certain thresholds (which should be, if my late-night math stands,

>>> [2 ** (2 * i - 1) for i in range(4, 20)]
[
    128,
    512,
    2048,
    8192,
    32768,
    131072,
    524288,
    2097152,
    8388608,
    33554432,
    134217728,
    536870912,
    2147483648,
    8589934592,
    34359738368,
    137438953472,
    ...,
]

Hope this sheds some light at things :)

Curious memory consumption of pandas.unique()

Tags:

Python

Pandas

Performance

Algorithm

Related

Recent Posts