When reading huge HDF5 file with "pandas.read_hdf() ", why do I still get MemoryError even though I read in chunks by specifying chunksize?
So the iterator is built mainly to deal with a where
clause. PyTables
returns a list of the indicies where the clause is True. These are row numbers. In this case, there is no where clause, but we still use the indexer, which in this case is simply np.arange
on the list of rows.
300MM rows takes 2.2GB. which is too much for windows 32-bit (generally maxes out around 1GB). On 64-bit this would be no problem.
In [1]: np.arange(0,300000000).nbytes/(1024*1024*1024.0)
Out[1]: 2.2351741790771484
So this should be handled by slicing semantics, which would make this take only a trivial amount of memory. Issue opened here.
So I would suggest this. Here the indexer is computed directly and this provides iterator semantics.
In [1]: df = DataFrame(np.random.randn(1000,2),columns=list('AB'))
In [2]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
In [3]: store = pd.HDFStore('test.h5')
In [4]: nrows = store.get_storer('df').nrows
In [6]: chunksize = 100
In [7]: for i in xrange(nrows//chunksize + 1):
chunk = store.select('df',
start=i*chunksize,
stop=(i+1)*chunksize)
# work on the chunk
In [8]: store.close()