Large, persistent DataFrame in pandas
Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:
import pandas as pd
tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.
df = pd.concat(tp, ignore_index=True) # df is DataFrame. If errors, do `list(tp)` instead of `tp`
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv
on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap
), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000
) then concatenate then with pd.concat
. The problem comes in when you pull the entire text file into memory in one big slurp.