Python dictionary lookup speed with NumPy datatypes

In my timings, your II - Without NumPy is quite a bit slower than I

In [11]: timeit [lookupdict[k] for k in np.random.choice(lookupdict.keys(),1000000)]
1 loops, best of 3: 658 ms per loop

In [12]: timeit [lookupdict[k] for k in [np.random.choice(lookupdict.keys()) for _ in range(1000000)]]
1 loops, best of 3: 8.04 s per loop

But if skip the lookup by making the choice on the values, you gain more time

In [34]: timeit np.random.choice(lookupdict.values(),1000000)
10 loops, best of 3: 85.3 ms per loop

OK, lets focus on the lookup:

In [26]: arr =np.random.choice(lookupdict.keys(),1000000)

In [27]: arrlist=arr.tolist()

In [28]: timeit res = [lookupdict[k] for k in arr]
1 loops, best of 3: 583 ms per loop

In [29]: timeit res = [lookupdict[k] for k in arrlist]
10 loops, best of 3: 120 ms per loop

In [30]: timeit res = [lookupdict[k] for k in list(arr)]
1 loops, best of 3: 675 ms per loop

In [31]: timeit res = [lookupdict[k] for k in arr.tolist()]
10 loops, best of 3: 156 ms per loop

In [32]: timeit res = [k for k in arr]
1 loops, best of 3: 215 ms per loop

In [33]: timeit res = [k for k in arrlist]
10 loops, best of 3: 51.4 ms per loop

In [42]: timeit arr.tolist()
10 loops, best of 3: 33.6 ms per loop

In [43]: timeit list(arr)
1 loops, best of 3: 264 ms per loop

First observation - iteration over an np.array is slower than iteration over the equivalent list

Second - list(arr) is slower the arr.tolist(). list() appears to have 2 problems. By itself it is slower, and the items are np.int32.


Here's a solution using Pandas which gives a five-fold improvement:

import numpy as np
import pandas as pd

# dictionary to use as the lookup dictionary
lookupdict = {
 1: "val1",
 2: "val2",
27: "val3",
35: "val4",
59: "val5" }

# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)

# create a list of words looked up
%timeit res = [ lookupdict[k] for k in arr ]
%timeit res_pd = pd.Series(lookupdict).reindex(arr).values
print all(res == res_pd)

10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 35.3 ms per loop
True

This is an average of 35ns per element and so must be impossible to beat in native Python. If you're not familiar with Pandas, the Series object is like an OrderedDict or an indexed array, which can be constructed from a standard Python dict. The reindex method provides very fast look-up; I'm not sure how as I don't really know what goes on under the hood (I'm not a very experienced programmer) but it's probably written in C or Cython. Maybe you could check out the source and come up with an even faster bespoke solution to your problem. Finally, the values attribute simply returns the array underlying the Series.

EDIT: Here's a purely numpy solution which is almost as good as the Pandas:

keys = np.array(lookupdict.keys())
strings = np.array(lookupdict.values())
%timeit res_np = strings[(np.atleast_2d(arr).T == keys).argmax(axis=1)]
10 loops, best of 3: 44.6 ms per loop

print all(res == res_np)
True

As you suspected, it's int32.__hash__ whose at fault here, being x11 as slow as int.__hash__:

%timeit hash(5)
10000000 loops, best of 3: 39.2 ns per loop
%timeit hash(np.int32(5))
1000000 loops, best of 3: 444 ns per loop

(the int32 type is implemented in C. If you're really curios, you can dig in the source code and find out what it's doing there which takes so long).


EDIT:

A second part which slows things down is the implicit == comparison on dict lookup:

a = np.int32(5)
b = np.int32(5)
%timeit a == b  # comparing two int32's
10000000 loops, best of 3: 61.9 ns per loop
%timeit a == 5  # comparing int32 against int -- much slower
100000 loops, best of 3: 2.62 us per loop

This explains why your V is so much faster than I and IV. Of course, sticking with an all-int solution would be faster.


So as I see it, you have two options:

  1. stick with the pure int type, or convert to int before the dict-lookup
  2. if the biggest code value is not too big, and/or memory is not a problem, you can trade dict-lookups for list-indexing, which do not require hashing.

E.g.:

lookuplist = [None] * (max(lookupdict.keys()) + 1)
for k,v in lookupdict.items():
    lookuplist[k] = v

res = [ lookuplist[k] for k in arr ] # using list indexing

(EDIT: you might also want to experiment with np.choose here)


This is interesting, I may have found an answer to my question.

The alternative III was to convert the array into a list. It seems that this provides very good results if done the right way. This:

res = [ lookupdict[k] for k in list(arr) ]

clocks 778 ms.

But this:

res = [ lookupdict[k] for k in arr.tolist() ]

clocks 86 ms.

The technical explanation behind this being that arr.tolist converts the array into int objects, whereas list(arr) creates a list of np.int32 objects.