Pandas DataFrame performance
A dict is to a DataFrame as a bicycle is to a car. You can pedal 10 feet on a bicycle faster than you can start a car, get it in gear, etc, etc. But if you need to go a mile, the car wins.
For certain small, targeted purposes, a dict may be faster. And if that is all you need, then use a dict, for sure! But if you need/want the power and luxury of a DataFrame, then a dict is no substitute. It is meaningless to compare speed if the data structure does not first satisfy your needs.
Now for example -- to be more concrete -- a dict is good for accessing columns, but it is not so convenient for accessing rows.
import timeit
setup = '''
import numpy, pandas
df = pandas.DataFrame(numpy.zeros(shape=[10, 1000]))
dictionary = df.to_dict()
'''
# f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]']
f = ['value = [val[5] for col,val in dictionary.items()]', 'value = df.loc[5]', 'value = df.iloc[5]']
for func in f:
print(func)
print(min(timeit.Timer(func, setup).repeat(3, 100000)))
yields
value = [val[5] for col,val in dictionary.iteritems()]
25.5416321754
value = df.loc[5]
5.68071913719
value = df.iloc[5]
4.56006002426
So the dict of lists is 5 times slower at retrieving rows than df.iloc
. The speed deficit becomes greater as the number of columns grows. (The number of columns is like the number of feet in the bicycle analogy. The longer the distance, the more convenient the car becomes...)
This is just one example of when a dict of lists would be less convenient/slower than a DataFrame.
Another example would be when you have a DatetimeIndex for the rows and wish to select all rows between certain dates. With a DataFrame you can use
df.loc['2000-1-1':'2000-3-31']
There is no easy analogue for that if you were to use a dict of lists. And the Python loops you would need to use to select the right rows would again be terribly slow compared to the DataFrame.
It seems the performance difference is much smaller now (0.21.1 -- I forgot what was the version of Pandas in the original example). Not only the performance gap between dictionary access and .loc
reduced (from about 335 times to 126 times slower), loc
(iloc
) is less than two times slower than at
(iat
) now.
In [1]: import numpy, pandas
...: ...: df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
...: ...: dictionary = df.to_dict()
...:
In [2]: %timeit value = dictionary[5][5]
85.5 ns ± 0.336 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [3]: %timeit value = df.loc[5, 5]
10.8 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: %timeit value = df.at[5, 5]
6.87 µs ± 64.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [5]: %timeit value = df.iloc[5, 5]
14.9 µs ± 114 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]: %timeit value = df.iat[5, 5]
9.89 µs ± 54.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: print(pandas.__version__)
0.21.1
---- Original answer below ----
+1 for using at
or iat
for scalar operations. Example benchmark:
In [1]: import numpy, pandas
...: df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
...: dictionary = df.to_dict()
In [2]: %timeit value = dictionary[5][5]
The slowest run took 34.06 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 310 ns per loop
In [4]: %timeit value = df.loc[5, 5]
10000 loops, best of 3: 104 µs per loop
In [5]: %timeit value = df.at[5, 5]
The slowest run took 6.59 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 9.26 µs per loop
In [6]: %timeit value = df.iloc[5, 5]
10000 loops, best of 3: 98.8 µs per loop
In [7]: %timeit value = df.iat[5, 5]
The slowest run took 6.67 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 9.58 µs per loop
It seems using at
(iat
) is about 10 times faster than loc
(iloc
).