Unable to save DataFrame to HDF5 ("object header message is too large")
Although this thread is more than 5 years old the problem is still relevant. It´s still not possible to save a DataFrame with more than 2000 columns as one table into a HDFStore. Using format='fixed'
isn´t an option if one wants to choose which columns to read from the HDFStore later.
Here is a function that splits the DataFrame into smaller ones and stores them as seperate tables. Additionally a pandas.Series
is put to the HDFStore that contains the information to which table a column belongs.
def wideDf_to_hdf(filename, data, columns=None, maxColSize=2000, **kwargs):
"""Write a `pandas.DataFrame` with a large number of columns
to one HDFStore.
Parameters
-----------
filename : str
name of the HDFStore
data : pandas.DataFrame
data to save in the HDFStore
columns: list
a list of columns for storing. If set to `None`, all
columns are saved.
maxColSize : int (default=2000)
this number defines the maximum possible column size of
a table in the HDFStore.
"""
import numpy as np
from collections import ChainMap
store = pd.HDFStore(filename, **kwargs)
if columns is None:
columns = data.columns
colSize = columns.shape[0]
if colSize > maxColSize:
numOfSplits = np.ceil(colSize / maxColSize).astype(int)
colsSplit = [
columns[i * maxColSize:(i + 1) * maxColSize]
for i in range(numOfSplits)
]
_colsTabNum = ChainMap(*[
dict(zip(columns, ['data{}'.format(num)] * colSize))
for num, columns in enumerate(colsSplit)
])
colsTabNum = pd.Series(dict(_colsTabNum)).sort_index()
for num, cols in enumerate(colsSplit):
store.put('data{}'.format(num), data[cols], format='table')
store.put('colsTabNum', colsTabNum, format='fixed')
else:
store.put('data', data[columns], format='table')
store.close()
DataFrames stored into a HDFStore with the function above can be read with the following function.
def read_hdf_wideDf(filename, columns=None, **kwargs):
"""Read a `pandas.DataFrame` from a HDFStore.
Parameter
---------
filename : str
name of the HDFStore
columns : list
the columns in this list are loaded. Load all columns,
if set to `None`.
Returns
-------
data : pandas.DataFrame
loaded data.
"""
store = pd.HDFStore(filename)
data = []
colsTabNum = store.select('colsTabNum')
if colsTabNum is not None:
if columns is not None:
tabNums = pd.Series(
index=colsTabNum[columns].values,
data=colsTabNum[columns].data).sort_index()
for table in tabNums.unique():
data.append(
store.select(table, columns=tabsNum[table], **kwargs))
else:
for table in colsTabNum.unique():
data.append(store.select(table, **kwargs))
data = pd.concat(data, axis=1).sort_index(axis=1)
else:
data = store.select('data', columns=columns)
store.close()
return data
As of 2014, the hdf is updated
If you are using HDF5 1.8.0 or previous releases, there is a limit on the number of fields you can have in a compound datatype. This is due to the 64K limit on object header messages, into which datatypes are encoded. (However, you can create a lot of fields before it will fail. One user was able to create up to 1260 fields in a compound datatype before it failed.)
As for pandas
, it can save Dataframe with arbirtary number of columns with format='fixed'
option, format 'table' still raises the same error as in topic.
I've also tried h5py
, and got the error of 'too large header' as well (though I had version > 1.8.0).
HDF5 has a header limit of 64kb for all metadata of the columns. This include name, types, etc. When you go about roughly 2000 columns, you will run out of space to store all the metadata. This is a fundamental limitation of pytables. I don't think they will make workarounds on their side any time soon. You will either have to split the table up or choose another storage format.