Saving in a file an array or DataFrame together with other information
There are many options. I will discuss only HDF5, because I have experience using this format.
Advantages: Portable (can be read outside of Python), native compression, out-of-memory capabilities, metadata support.
Disadvantages: Reliance on single low-level C API, possibility of data corruption as a single file, deleting data does not reduce size automatically.
In my experience, for performance and portability, avoid pyTables
/ HDFStore
to store numeric data. You can instead use the intuitive interface provided by h5py
.
Store an array
import h5py, numpy as np
arr = np.random.randint(0, 10, (1000, 1000))
f = h5py.File('file.h5', 'w', libver='latest') # use 'latest' for performance
dset = f.create_dataset('array', shape=(1000, 1000), data=arr, chunks=(100, 100),
compression='gzip', compression_opts=9)
Compression & chunking
There are many compression choices, e.g. blosc
and lzf
are good choices for compression and decompression performance respectively. Note gzip
is native; other compression filters may not ship by default with your HDF5 installation.
Chunking is another option which, when aligned with how you read data out-of-memory, can significantly improve performance.
Add some attributes
dset.attrs['Description'] = 'Some text snippet'
dset.attrs['RowIndexArray'] = np.arange(1000)
Store a dictionary
for k, v in d.items():
f.create_dataset('dictgroup/'+str(k), data=v)
Out-of-memory access
dictionary = f['dictgroup']
res = dictionary['my_key']
There is no substitute for reading the h5py
documentation, which exposes most of the C API, but you should see from the above there is a significant amount of flexibility.