Combining hdf5 files
This is actually one of the use-cases of HDF5. If you just want to be able to access all the datasets from a single file, and don't care how they're actually stored on disk, you can use external links. From the HDF5 website:
External links allow a group to include objects in another HDF5 file and enable the library to access those objects as if they are in the current file. In this manner, a group may appear to directly contain datasets, named datatypes, and even groups that are actually in a different file. This feature is implemented via a suite of functions that create and manage the links, define and retrieve paths to external objects, and interpret link names:
Here's how to do it in h5py:
myfile = h5py.File('foo.hdf5','a')
myfile['ext link'] = h5py.ExternalLink("otherfile.hdf5", "/path/to/resource")
Be careful: when opening myfile
, you should open it with 'a'
if it is an existing file. If you open it with 'w'
, it will erase its contents.
This would be very much faster than copying all the datasets into a new file. I don't know how fast access to otherfile.hdf5
would be, but operating on all the datasets would be transparent - that is, h5py would see all the datasets as residing in foo.hdf5
.
I found a non-python solution by using h5copy from the official hdf5 tools. h5copy can copy individual specified datasets from an hdf5 file into another existing hdf5 file.
If someone finds a python/h5py-based solution I would be glad to hear about it.
One solution is to use the h5py
interface to the low-level H5Ocopy
function of the HDF5 API, in particular the h5py.h5o.copy
function:
In [1]: import h5py as h5
In [2]: hf1 = h5.File("f1.h5")
In [3]: hf2 = h5.File("f2.h5")
In [4]: hf1.create_dataset("val", data=35)
Out[4]: <HDF5 dataset "val": shape (), type "<i8">
In [5]: hf1.create_group("g1")
Out[5]: <HDF5 group "/g1" (0 members)>
In [6]: hf1.get("g1").create_dataset("val2", data="Thing")
Out[6]: <HDF5 dataset "val2": shape (), type "|O8">
In [7]: hf1.flush()
In [8]: h5.h5o.copy(hf1.id, "g1", hf2.id, "newg1")
In [9]: h5.h5o.copy(hf1.id, "val", hf2.id, "newval")
In [10]: hf2.values()
Out[10]: [<HDF5 group "/newg1" (1 members)>, <HDF5 dataset "newval": shape (), type "<i8">]
In [11]: hf2.get("newval").value
Out[11]: 35
In [12]: hf2.get("newg1").values()
Out[12]: [<HDF5 dataset "val2": shape (), type "|O8">]
In [13]: hf2.get("newg1").get("val2").value
Out[13]: 'Thing'
The above was generated with h5py
version 2.0.1-2+b1
and iPython version 0.13.1-2+deb7u1
atop Python version 2.7.3-4+deb7u1
from a more-or-less vanilla install of Debian Wheezy. The files f1.h5
and f2.h5
did not exist prior to executing the above. Note that, per salotz, for Python 3 the dataset/group names need to be bytes
(e.g., b"val"
), not str
.
The hf1.flush()
in command [7]
is crucial, as the low-level interface apparently will always draw from the version of the .h5
file stored on disk, not that cached in memory. Copying datasets to/from groups not at the root of a File
can be achieved by supplying the ID of that group using, e.g., hf1.get("g1").id
.
Note that h5py.h5o.copy
will fail with an exception (no clobber) if an object of the indicated name already exists in the destination location.