NumPy grouping using itertools.groupby performance
I get a three times improvement doing something like this:
def group():
import numpy as np
values = np.array(np.random.randint(0, 3298, size=35000000), dtype='u4')
values.sort()
dif = np.ones(values.shape, values.dtype)
dif[1:] = np.diff(values)
idx = np.where(dif>0)
vals = values[idx]
count = np.diff(idx)
More than 5 years have passed since Paul's answer was accepted. Interestingly,
the sort()
is still the bottleneck in the accepted solution.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 @profile
4 def group_paul():
5 1 99040 99040.0 2.4 import numpy as np
6 1 305651 305651.0 7.4 values = np.array(np.random.randint(0, 2**32,size=35000000),dtype='u4')
7 1 2928204 2928204.0 71.3 values.sort()
8 1 78268 78268.0 1.9 diff = np.concatenate(([1],np.diff(values)))
9 1 215774 215774.0 5.3 idx = np.concatenate((np.where(diff)[0],[len(values)]))
10 1 95 95.0 0.0 index = np.empty(len(idx)-1,dtype='u4,u2')
11 1 386673 386673.0 9.4 index['f0'] = values[idx[:-1]]
12 1 91492 91492.0 2.2 index['f1'] = np.diff(idx)
The accepted solution runs for 4.0 s on my machine, with radix sort it drops down to 1.7 s.
Just by switching to radix sort, I get an overall 2.35x speedup. The radix sort is more than 4x faster than quicksort in this case.
See How to sort an array of integers faster than quicksort? that was motivated by your question.
For the profiling I used line_profiler and kernprof (the @profile
comes from there).