How to use shared memory instead of passing objects via pickling between multiple processes
Since Python 3.8, there will be multiprocessing.shared_memory
enabling direct memory sharing between processes, similar to "real" multi-threading in C or Java. Direct memory sharing can be significantly faster than sharing via files, sockets, or data copy serialization/deserialization.
It works by providing a shared memory buffer via the SharedMemory
class that different processes can claim and declare variables on. More advanced memory buffer managing is supported via the SharedMemoryManager
class. Variables in basic python data types can be conveniently declared using the built-in ShareableList
. Variables in advanced data types such as numpy.ndarray
can be shared by specifying the memory buffer when declaring.
Example with numpy.ndarray
:
import numpy as np
import multiprocessing as mp
data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
d_shape = (len(data),)
d_type = np.int64
d_size = np.dtype(d_type).itemsize * np.prod(d_shape)
# In main process
# allocate new shared memory
shm = mp.shared_memory.SharedMemory(create=True, size=d_size)
shm_name = shm.name
# numpy array on shared memory buffer
a = np.ndarray(shape=d_shape, dtype=d_type, buffer=shm.buf)
# copy data into shared memory ndarray once
a[:] = data[:]
# In other processes
# reuse existing shared memory
ex_shm = mp.shared_memory.SharedMemory(name=shm_name)
# numpy array on existing memory buffer, a and b read/write the same memory
b = np.ndarray(shape=d_shape, dtype=d_type, buffer=ex_shm.buf)
ex_shm.close() # close after using
# In main process
shm.close() # close after using
shm.unlink() # free memory
In the above code, a
and b
arrays use the same underlying memory and can update the same memory directly, which can be very useful in machine learning. However, you should beware of the concurrent update problems and decide how to handle them, either by using Lock
/partitioned accesses/or accept stochastic updates (the so-called HogWild style).
Use files!
No, really, use files -- they are are efficient (OS will cache the content), and allow you to work on much larger problems (data set doesn't have to fit into RAM).
Use any of https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.io.html to dump/load numpy arrays to/from files and only pass file names between the processes.
P.S. benchmark serialisation methods, depending on the intermediate array size, the fastest could be "raw" (no conversion overhead) or "compressed" (if file ends up being written to disk) or something else. IIRC loading "raw" files may require knowing data format (dimensions, sizes) in advance.
Check out the ray project which is a distributed execution framework that makes use of apache arrow for serialization. It's especially great if you're working with numpy arrays and hence is a great tool for ML workflows.
Here's a snippet from the docs on object serialization
In Ray, we optimize for numpy arrays by using the Apache Arrow data format. When we deserialize a list of numpy arrays from the object store, we still create a Python list of numpy array objects. However, rather than copy each numpy array, each numpy array object holds a pointer to the relevant array held in shared memory. There are some advantages to this form of serialization.
- Deserialization can be very fast.
- Memory is shared between processes so worker processes can all read the same data without having to copy it.
In my opinion it's even easier to use than the multiprocessing library for parallel execution especially when looking to use shared memory, intro to usage in the tutorial.