Is there a performance difference between Numpy and Pandas?
I think it's more about using the two strategically and shifting data around (from numpy to pandas or vice versa) based on the performance you see. As a recent example, I was trying to concatenate 4 small pickle files with 10k rows each data.shape -> (10,000, 4)
using numpy.
Code was something like:
n_concat = np.empty((0,4))
for file_path in glob.glob('data/0*', recursive=False):
n_data = joblib.load(file_path)
n_concat = np.vstack((co_np, filtered_snp))
joblib.dump(co_np, 'data/save_file.pkl', compress = True)
This crashed my laptop (8 GB, i5) which was surprising since the volume wasn't really that huge. The 4 compressed pickled files were roughly around 5 MB each.
The same thing, worked great on pandas.
for file_path in glob.glob('data/0*', recursive=False):
n_data = joblib.load(sd)
try:
df = pd.concat([df, pd.DataFrame(n_data, columns = [...])])
except NameError:
df = pd.concat([pd.DataFrame(n_data,columns = [...])])
joblib.dump(df, 'data/save_file.pkl', compress = True)
One the other hand, when I was implementing gradient descent by iterating over a pandas data frame, it was horribly slow, while using numpy for the job was much quicker.
In general, I've seen that pandas usually works better for moving around/munging moderately large chunks of data and doing common column operations while numpy works best for vectorized and recursive work (maybe more math intense work) over smaller sets of data.
Moving data between the two is hassle free, so I guess, using both strategically is the way to go.
In my experiments on large numeric data, Pandas is consistently 20 TIMES SLOWER than Numpy. This is a huge difference, given that only simple arithmetic operations were performed: slicing of a column, mean(), searchsorted() - see below. Initially, I thought Pandas was based on numpy, or at least its implementation was C optimized just like numpy's. These assumptions turn out to be false, though, given the huge performance gap.
In examples below, data
is a pandas frame with 8M rows and 3 columns (int32, float32, float32), without NaN values, column #0 (time) is sorted. data_np
was created as data.values.astype('float32')
. Results on Python 3.8, Ubuntu:
A. Column slices and mean():
# Pandas
%%timeit
x = data.x
for k in range(100): x[100000:100001+k*100].mean()
15.8 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Numpy
%%timeit
for k in range(100): data_np[100000:100001+k*100,1].mean()
874 µs ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Pandas is 18 times slower than Numpy (15.8ms vs 0.874 ms).
B. Search in a sorted column:
# Pandas
%timeit data.time.searchsorted(1492474643)
20.4 µs ± 920 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Numpy
%timeit data_np[0].searchsorted(1492474643)
1.03 µs ± 3.55 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Pandas is 20 times slower than Numpy (20.4µs vs 1.03µs).
EDIT: I implemented a namedarray class that bridges the gap between Pandas and Numpy in that it is based on Numpy's ndarray class and hence performs better than Pandas (typically ~7x faster) and is fully compatible with Numpy'a API and all its operators; but at the same time it keeps column names similar to Pandas' DataFrame, so that manipulating on individual columns is easier. This is a prototype implementation. Unlike Pandas, namedarray does not allow for different data types for columns. The code can be found here: https://github.com/mwojnars/nifty/blob/master/math.py (search "namedarray").
There can be a significant performance difference, of an order of magnitude for multiplications and multiple orders of magnitude for indexing a few random values.
I was actually wondering about the same thing and came across this interesting comparison: http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/