Seeding random number generators in parallel programs
numpy 1.17 just introduced [quoting] "..three strategies implemented that can be used to produce repeatable pseudo-random numbers across multiple processes (local or distributed).."
the 1st strategy is using a SeedSequence object. There are many parent / child options there, but for our case, if you want the same generated random numbers, but different at each run:
(python3, printing 3 random numbers from 4 processes)
from numpy.random import SeedSequence, default_rng
from multiprocessing import Pool
def rng_mp(rng):
return [ rng.random() for i in range(3) ]
seed_sequence = SeedSequence()
n_proc = 4
pool = Pool(processes=n_proc)
pool.map(rng_mp, [ default_rng(seed_sequence) for i in range(n_proc) ])
# 2 different runs
[[0.2825724770857644, 0.6465318335272593, 0.4620869345284885],
[0.2825724770857644, 0.6465318335272593, 0.4620869345284885],
[0.2825724770857644, 0.6465318335272593, 0.4620869345284885],
[0.2825724770857644, 0.6465318335272593, 0.4620869345284885]]
[[0.04503760429109904, 0.2137916986051025, 0.8947678672387492],
[0.04503760429109904, 0.2137916986051025, 0.8947678672387492],
[0.04503760429109904, 0.2137916986051025, 0.8947678672387492],
[0.04503760429109904, 0.2137916986051025, 0.8947678672387492]]
If you want the same result for reproducing purposes, you can simply reseed numpy with the same seed (17):
import numpy as np
from multiprocessing import Pool
def rng_mp(seed):
np.random.seed(seed)
return [ np.random.rand() for i in range(3) ]
n_proc = 4
pool = Pool(processes=n_proc)
pool.map(rng_mp, [17] * n_proc)
# same results each run:
[[0.2946650026871097, 0.5305867556052941, 0.19152078694749486],
[0.2946650026871097, 0.5305867556052941, 0.19152078694749486],
[0.2946650026871097, 0.5305867556052941, 0.19152078694749486],
[0.2946650026871097, 0.5305867556052941, 0.19152078694749486]]
Here is a nice blog post that will explains the way numpy.random
works.
If you use np.random.rand()
it will takes the seed created when you imported the np.random
module. So you need to create a new seed at each thread manually (cf examples in the blog post for example).
The python random module does not have this issue and automatically generates different seed for each thread.
If no seed is provided explicitly, numpy.random
will seed itself using an OS-dependent source of randomness. Usually it will use /dev/urandom
on Unix-based systems (or some Windows equivalent), but if this is not available for some reason then it will seed itself from the wall clock. Since self-seeding occurs at the time when a new subprocess forks, it is possible for multiple subprocesses to inherit the same seed if they forked at the same time, leading to identical random variates being produced by different subprocesses.
Often this correlates with the number of concurrent threads you are running. For example:
import numpy as np
import random
from multiprocessing import Pool
def Foo_np(seed=None):
# np.random.seed(seed)
return np.random.uniform(0, 1, 5)
pool = Pool(processes=8)
print np.array(pool.map(Foo_np, xrange(20)))
# [[ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.64672339 0.99851749 0.8873984 0.42734339 0.67158796]
# [ 0.64672339 0.99851749 0.8873984 0.42734339 0.67158796]
# [ 0.64672339 0.99851749 0.8873984 0.42734339 0.67158796]
# [ 0.64672339 0.99851749 0.8873984 0.42734339 0.67158796]
# [ 0.64672339 0.99851749 0.8873984 0.42734339 0.67158796]
# [ 0.11283279 0.28180632 0.28365286 0.51190168 0.62864241]
# [ 0.11283279 0.28180632 0.28365286 0.51190168 0.62864241]
# [ 0.28917586 0.40997875 0.06308188 0.71512199 0.47386047]
# [ 0.11283279 0.28180632 0.28365286 0.51190168 0.62864241]
# [ 0.64672339 0.99851749 0.8873984 0.42734339 0.67158796]
# [ 0.11283279 0.28180632 0.28365286 0.51190168 0.62864241]
# [ 0.14463001 0.80273208 0.5559258 0.55629762 0.78814652] <-
# [ 0.11283279 0.28180632 0.28365286 0.51190168 0.62864241]]
You can see that groups of up to 8 threads simultaneously forked with the same seed, giving me identical random sequences (I've marked the first group with arrows).
Calling np.random.seed()
within a subprocess forces the thread-local RNG instance to seed itself again from /dev/urandom
or the wall clock, which will (probably) prevent you from seeing identical output from multiple subprocesses. Best practice is to explicitly pass a different seed (or numpy.random.RandomState
instance) to each subprocess, e.g.:
def Foo_np(seed=None):
local_state = np.random.RandomState(seed)
print local_state.uniform(0, 1, 5)
pool.map(Foo_np, range(20))
I'm not entirely sure what underlies the differences between random
and numpy.random
in this respect (perhaps it has slightly different rules for selecting a source of randomness to self-seed with compared to numpy.random
?). I would still recommend explicitly passing a seed or a random.Random
instance to each subprocess to be on the safe side. You could also use the .jumpahead()
method of random.Random
which is designed for shuffling the states of Random
instances in multithreaded programs.