Why is numpy.random.choice so slow?

This solution with a cumulative score is about 25x faster:

def choice(options,probs):
    x = np.random.rand()
    cum = 0
    for i,p in enumerate(probs):
        cum += p
        if x < cum:
            break
    return options[i]


options = ['a','b','c','d']
probs = [0.2,0.6,0.15,0.05]
runs = 100000


now = time.time()
temp = []
for i in range(runs):
    op = choice(options,probs)
    temp.append(op)
temp = Counter(temp)
for op,x in temp.items():
    print(op,x/runs)
print(time.time()-now)

print("")
now = time.time()
temp = []
for i in range(runs):
    op = np.random.choice(options,p = probs)
    temp.append(op)
temp = Counter(temp)
for op,x in temp.items():
    print(op,x/runs)
print(time.time()-now)

Running it I get:

b 0.59891
a 0.20121
c 0.15007
d 0.04981
0.16232800483703613

b 0.5996
a 0.20138
c 0.14856
d 0.05046
3.8451428413391113

I suspect the generality of np.random.choice is slowing it down, more so for small samples than large ones.

A crude vectorization of the if version is:

def foo(n):
    x = np.random.rand(n)
    var = np.zeros(n)
    var[x<.25] = -1
    var[x>.75] = 1
    return var

Running in ipython I get:

timeit np.random.choice([-1,0,1],size=1000,p=[.25,.5,.25])
1000 loops, best of 3: 293 us per loop

timeit foo(1000)
10000 loops, best of 3: 83.4 us per loop

timeit np.random.choice([-1,0,1],size=100000,p=[.25,.5,.25])
100 loops, best of 3: 11 ms per loop

timeit foo(100000)
100 loops, best of 3: 8.12 ms per loop

So for the 1000 size, choice is 3-4x slower, but with larger vectors, the difference starts to disappear.


You're using it wrong. Vectorize the operation, or numpy will offer no benefit:

var = numpy.random.choice([-1, 0, 1], size=1000, p=[0.25, 0.5, 0.25])

Timing data:

>>> timeit.timeit('''numpy.random.choice([-1, 0, 1],
...                                      size=1000,
...                                      p=[0.25, 0.5, 0.25])''',
...               'import numpy', number=10000)
2.380380242513752

>>> timeit.timeit('''
... var = []
... for i in xrange(1000):
...     tmp = rand.rand()
...     if tmp < 0.25:
...         var.append(1)
...     elif tmp < 0.5:
...         var.append(-1)
...     else:
...         var.append(0)''',
... setup='import numpy.random as rand', number=10000)
5.673041396894519

It took me a really long time to figure out that my data generator is very slow due to the random key sampling via np.random.choice.

In case the non-uniform distribution is NOT needed, then here is the workable solution I found.

Replace

def get_random_key(a_huge_key_list):
    return np.random.choice(a_huge_key_list)

with

def get_random_key(a_huge_key_list):
    L = len(a_huge_key_list)
    i = np.random.randint(0, L)
    return a_huge_key_list[i]

which gives a speedup of x60.