When reading huge HDF5 file with "pandas.read_hdf() ", why do I still get MemoryError even though I read in chunks by specifying chunksize?

So the iterator is built mainly to deal with a where clause. PyTables returns a list of the indicies where the clause is True. These are row numbers. In this case, there is no where clause, but we still use the indexer, which in this case is simply np.arange on the list of rows.

300MM rows takes 2.2GB. which is too much for windows 32-bit (generally maxes out around 1GB). On 64-bit this would be no problem.

In [1]: np.arange(0,300000000).nbytes/(1024*1024*1024.0)
Out[1]: 2.2351741790771484

So this should be handled by slicing semantics, which would make this take only a trivial amount of memory. Issue opened here.

So I would suggest this. Here the indexer is computed directly and this provides iterator semantics.

In [1]: df = DataFrame(np.random.randn(1000,2),columns=list('AB'))

In [2]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)

In [3]: store = pd.HDFStore('test.h5')

In [4]: nrows = store.get_storer('df').nrows

In [6]: chunksize = 100

In [7]: for i in xrange(nrows//chunksize + 1):
            chunk = store.select('df',
                                 start=i*chunksize,
                                 stop=(i+1)*chunksize)
            # work on the chunk    

In [8]: store.close()

When reading huge HDF5 file with "pandas.read_hdf() ", why do I still get MemoryError even though I read in chunks by specifying chunksize?

Tags:

Python

Pandas

Hdf5

Related

Recent Posts