best way to preserve numpy arrays on disk

There is now a HDF5 based clone of pickle called hickle!

https://github.com/telegraphic/hickle

import hickle as hkl 

data = {'name': 'test', 'data_arr': [1, 2, 3, 4]}

# Dump data to file
hkl.dump(data, 'new_data_file.hkl')

# Load data from file
data2 = hkl.load('new_data_file.hkl')

print(data == data2)

EDIT:

There also is the possibility to "pickle" directly into a compressed archive by doing:

import pickle, gzip, lzma, bz2

pickle.dump(data, gzip.open('data.pkl.gz', 'wb'))
pickle.dump(data, lzma.open('data.pkl.lzma', 'wb'))
pickle.dump(data, bz2.open('data.pkl.bz2', 'wb'))

compression


Appendix

import numpy as np
import matplotlib.pyplot as plt
import pickle, os, time
import gzip, lzma, bz2, h5py

compressions = ['pickle', 'h5py', 'gzip', 'lzma', 'bz2']
modules = dict(
    pickle=pickle, h5py=h5py, gzip=gzip, lzma=lzma, bz2=bz2
)

labels = ['pickle', 'h5py', 'pickle+gzip', 'pickle+lzma', 'pickle+bz2']
size = 1000

data = {}

# Random data
data['random'] = np.random.random((size, size))

# Not that random data
data['semi-random'] = np.zeros((size, size))
for i in range(size):
    for j in range(size):
        data['semi-random'][i, j] = np.sum(
            data['random'][i, :]) + np.sum(data['random'][:, j]
        )

# Not random data
data['not-random'] = np.arange(
    size * size, dtype=np.float64
).reshape((size, size))

sizes = {}

for key in data:

    sizes[key] = {}

    for compression in compressions:
        path = 'data.pkl.{}'.format(compression)

        if compression == 'pickle':
            time_start = time.time()
            pickle.dump(data[key], open(path, 'wb'))
            time_tot = time.time() - time_start
            sizes[key]['pickle'] = (
                os.path.getsize(path) * 10**-6, 
                time_tot.
            )
            os.remove(path)

        elif compression == 'h5py':
            time_start = time.time()
            with h5py.File(path, 'w') as h5f:
                h5f.create_dataset('data', data=data[key])
            time_tot = time.time() - time_start
            sizes[key][compression] = (os.path.getsize(path) * 10**-6, time_tot)
            os.remove(path)

        else:
            time_start = time.time()
            with modules[compression].open(path, 'wb') as fout:
                pickle.dump(data[key], fout)
            time_tot = time.time() - time_start
            sizes[key][labels[compressions.index(compression)]] = (
                os.path.getsize(path) * 10**-6, 
                time_tot,
            )
            os.remove(path)


f, ax_size = plt.subplots()
ax_time = ax_size.twinx()

x_ticks = labels
x = np.arange(len(x_ticks))

y_size = {}
y_time = {}
for key in data:
    y_size[key] = [sizes[key][x_ticks[i]][0] for i in x]
    y_time[key] = [sizes[key][x_ticks[i]][1] for i in x]

width = .2
viridis = plt.cm.viridis

p1 = ax_size.bar(x - width, y_size['random'], width, color = viridis(0))
p2 = ax_size.bar(x, y_size['semi-random'], width, color = viridis(.45))
p3 = ax_size.bar(x + width, y_size['not-random'], width, color = viridis(.9))
p4 = ax_time.bar(x - width, y_time['random'], .02, color='red')

ax_time.bar(x, y_time['semi-random'], .02, color='red')
ax_time.bar(x + width, y_time['not-random'], .02, color='red')

ax_size.legend(
    (p1, p2, p3, p4), 
    ('random', 'semi-random', 'not-random', 'saving time'),
    loc='upper center', 
    bbox_to_anchor=(.5, -.1), 
    ncol=4,
)
ax_size.set_xticks(x)
ax_size.set_xticklabels(x_ticks)

f.suptitle('Pickle Compression Comparison')
ax_size.set_ylabel('Size [MB]')
ax_time.set_ylabel('Time [s]')

f.savefig('sizes.pdf', bbox_inches='tight')

I'm a big fan of hdf5 for storing large numpy arrays. There are two options for dealing with hdf5 in python:

http://www.pytables.org/

http://www.h5py.org/

Both are designed to work with numpy arrays efficiently.


savez() save data in a zip file, It may take some time to zip & unzip the file. You can use save() & load() function:

f = file("tmp.bin","wb")
np.save(f,a)
np.save(f,b)
np.save(f,c)
f.close()

f = file("tmp.bin","rb")
aa = np.load(f)
bb = np.load(f)
cc = np.load(f)
f.close()

To save multiple arrays in one file, you just need to open the file first, and then save or load the arrays in sequence.


I've compared performance (space and time) for a number of ways to store numpy arrays. Few of them support multiple arrays per file, but perhaps it's useful anyway.

benchmark for numpy array storage

Npy and binary files are both really fast and small for dense data. If the data is sparse or very structured, you might want to use npz with compression, which'll save a lot of space but cost some load time.

If portability is an issue, binary is better than npy. If human readability is important, then you'll have to sacrifice a lot of performance, but it can be achieved fairly well using csv (which is also very portable of course).

More details and the code are available at the github repo.