Flushing numpy memmap to npy file
is there a way to infer the shape of the stored array?
No. As far as np.memmap
is concerned the file is just a buffer - it stores the contents of the array, but not the dimensions, dtype etc. There's no way to infer that information unless it's somehow contained within the array itself. If you've already created an np.memmap
backed by a simple binary file then you would need to write its contents to a new .npy
file on disk.
You could avoid generating a copy in memory by opening the new .npy
file as another memory-mapped array using numpy.lib.format.open_memmap
:
import numpy as np
from numpy.lib.format import open_memmap
# a 10GB memory-mapped array
x = np.memmap('/tmp/x.mm', mode='w+', dtype=np.ubyte, shape=(int(1E10),))
# create a memory-mapped .npy file with the same dimensions and dtype
y = open_memmap('/tmp/y.npy', mode='w+', dtype=x.dtype, shape=x.shape)
# copy the array contents
y[:] = x[:]
An array saved with np.save
is essentially a memmap with a header specifying dtype, shape, and element order. You can read more about it in the numpy documentation.
When you create your np.memmap
, you can reserve space for that header with the offset
parameter. The numpy documentation specifies that the header length should be a multiple of 64:
Let's say you reserve 2 * 64 = 128 bytes for the header (more on this below):
import numpy as np
x = np.memmap('/tmp/x.npy', mode='w+', dtype=np.ubyte,
shape=(int(1E10),), offset=128)
Then, when you are finished manipulating the memmap, you create and write the header, using np.lib.format
:
header = np.lib.format.header_data_from_array_1_0(x)
with open('/tmp/x.npy', 'r+b') as f:
np.lib.format.write_array_header_1_0(f, header)
Note that this writes the header from the start of the memmap file, so if len(header) > 128
, then it will overwrite part of the data, and your file will not be readable. The header is a fixed length magic string (6 bytes), two version bytes, two bytes specifying the header length, and a string representation of a dictionary specifying 'shape', 'descr', and 'order'. If you know the shape and the dtype (descr
) of your array, you can easily compute the header length (I fixed it at 128 above, for the sake of simplicity).
After writing the header you can load the data using np.load
:
y = np.load('/tmp/x.npy')
If the memmap you saved is large you might want to load the data as a memmap again:
y = np.load('/tmp/x.npy', mmap_mode='r')