Why is numpy/pandas parsing of a csv file with long lines so slow?
The array_csv
and numpy_csv
times are quite similar. If you look at the loadtxt
code you'll see that the actions are quite similar. With array_csv
you construct an array for each line and use it, while numpy_csv
collects the parsed (and converted) lines into one list, which is converted to an array at the end.
loadtxt
for each row does:
vals = split_line(line)
...
# Convert each value according to its column and store
items = [conv(val) for (conv, val) in zip(converters, vals)]
# Then pack it according to the dtype's nesting
items = pack_items(items, packing)
X.append(items)
with a final
X = np.array(X, dtype)
That [conv(val) for ...]
line is just a generalization of your [float(val) for val in ...]
.
If a plain list does the job, don't convert it to an array. That just adds unnecessary overhead.
Functions like loadtxt
are most valuable when the csv
columns contain a mix of data types. They streamline the work of creating structured arrays from that data. For pure numeric data such as yours they don't add much.
I can't speak for pandas
, except that it has yet another layer on top of numpy
, and does a lot of its own hardcoding.
If you are to give pandas the dtypes
as dictionary (pd.read_csv(...,dtype={'x':np.float})
it will make things much faster, as pandas tries to check the data type for every column.
Does your CSV file contain column headers? If not, then explicitly passing header=None
to pandas.read_csv
can give a slight performance improvement for the Python parsing engine (but not for the C engine):
In [1]: np.savetxt('test.csv', np.random.randn(1000, 20000), delimiter=',')
In [2]: %timeit pd.read_csv('test.csv', delimiter=',', engine='python')
1 loops, best of 3: 9.19 s per loop
In [3]: %timeit pd.read_csv('test.csv', delimiter=',', engine='c')
1 loops, best of 3: 6.47 s per loop
In [4]: %timeit pd.read_csv('test.csv', delimiter=',', engine='python', header=None)
1 loops, best of 3: 6.26 s per loop
In [5]: %timeit pd.read_csv('test.csv', delimiter=',', engine='c', header=None)
1 loops, best of 3: 6.46 s per loop
Update
If there are no missing or invalid values then you can do a little better by passing na_filter=False
(only valid for the C engine):
In [6]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None)
1 loops, best of 3: 6.42 s per loop
In [7]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False)
1 loops, best of 3: 4.72 s per loop
There may also be small gains to be had by specifying the dtype
explicitly:
In [8]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64)
1 loops, best of 3: 4.36 s per loop
Update 2
Following up on @morningsun's comment, setting low_memory=False
squeezes out a bit more speed:
In [9]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64, low_memory=True)
1 loops, best of 3: 4.3 s per loop
In [10]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64, low_memory=False)
1 loops, best of 3: 3.27 s per loop
For what it's worth, these benchmarks were all done using the current dev version of pandas (0.16.0-19-g8d2818e).
In the pure python case, you're iterating over the rows and printing as you go. In the pandas case, you're importing the whole thing into a DataFrame, and then iterating over the rows. But pandas' strength isn't in iterating over the rows - it's in operations that take place over the whole DataFrame. Compare the speed of:
def pandas_csv():
with open("../data/temp_fixed_l_no_initial", "r") as f:
print csv_loader(f).sum(axis=1)
This is still somewhat slower than the pure python approach, which you're welcome to use if this is the extent of your use case. But as @ali_m's comment points out, if you want to do more than print the sum of the rows, or if you want to transform the data in any way, you will probably find pandas or numpy to be more efficient both in processing time and programming time.