save numpy array in append mode
I made a library to create Numpy .npy
files that are larger than the main memory of the machine by appending on the zero axis. The file can then be read with mmap_mode="r"
.
https://pypi.org/project/npy-append-array
Installation
conda install -c conda-forge npy-append-array
or
pip install npy-append-array
Example
from npy_append_array import NpyAppendArray
import numpy as np
arr1 = np.array([[1,2],[3,4]])
arr2 = np.array([[1,2],[3,4],[5,6]])
filename = 'out.npy'
with NpyAppendArray(filename) as npaa:
npaa.append(arr1)
npaa.append(arr2)
npaa.append(arr2)
data = np.load(filename, mmap_mode="r")
print(data)
Implementation Details
Appending to an array created by np.save might be possible under certain circumstances, since the .npy total header byte count is required to be evenly divisible by 64. Thus, there might be some spare space to grow the shape entry in the array descriptor. However, this is not guaranteed and might randomly fail. Initialize the array with NpyAppendArray(filename) directly (see above) so the header will be created with 64 byte of spare header space for growth.
Will 64 byte extra header space cover my needs?
It allows for up to 10^64 >= 2^212 array entries or data bits. Indeed, this is less than the number of atoms in the universe. However, fully populating such an array, due to limits imposed by quantum mechanics, would require more energy than would be needed to boil the oceans, compare
https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans
Therefore, a wide range of use cases should be coverable with this approach.
The build-in .npy
file format is perfectly fine for working with small datasets, without relying on external modules other then numpy
.
However, when you start having large amounts of data, the use of a file format, such as HDF5, designed to handle such datasets, is to be preferred [1].
For instance, below is a solution to save numpy
arrays in HDF5 with PyTables,
Step 1: Create an extendable EArray
storage
import tables
import numpy as np
filename = 'outarray.h5'
ROW_SIZE = 100
NUM_COLUMNS = 200
f = tables.open_file(filename, mode='w')
atom = tables.Float64Atom()
array_c = f.create_earray(f.root, 'data', atom, (0, ROW_SIZE))
for idx in range(NUM_COLUMNS):
x = np.random.rand(1, ROW_SIZE)
array_c.append(x)
f.close()
Step 2: Append rows to an existing dataset (if needed)
f = tables.open_file(filename, mode='a')
f.root.data.append(x)
Step 3: Read back a subset of the data
f = tables.open_file(filename, mode='r')
print(f.root.data[1:10,2:20]) # e.g. read from disk only this part of the dataset