Generate random numbers with a given (numerical) distribution
An advantage to generating the list using CDF is that you can use binary search. While you need O(n) time and space for preprocessing, you can get k numbers in O(k log n). Since normal Python lists are inefficient, you can use array
module.
If you insist on constant space, you can do the following; O(n) time, O(1) space.
def random_distr(l):
r = random.uniform(0, 1)
s = 0
for item, prob in l:
s += prob
if s >= r:
return item
return item # Might occur because of floating point inaccuracies
(OK, I know you are asking for shrink-wrap, but maybe those home-grown solutions just weren't succinct enough for your liking. :-)
pdf = [(1, 0.1), (2, 0.05), (3, 0.05), (4, 0.2), (5, 0.4), (6, 0.2)]
cdf = [(i, sum(p for j,p in pdf if j < i)) for i,_ in pdf]
R = max(i for r in [random.random()] for i,c in cdf if c <= r)
I pseudo-confirmed that this works by eyeballing the output of this expression:
sorted(max(i for r in [random.random()] for i,c in cdf if c <= r)
for _ in range(1000))
Since Python 3.6, there's a solution for this in Python's standard library, namely random.choices
.
Example usage: let's set up a population and weights matching those in the OP's question:
>>> from random import choices
>>> population = [1, 2, 3, 4, 5, 6]
>>> weights = [0.1, 0.05, 0.05, 0.2, 0.4, 0.2]
Now choices(population, weights)
generates a single sample, contained in a list of length 1:
>>> choices(population, weights)
[4]
The optional keyword-only argument k
allows one to request more than one sample at once. This is valuable because there's some preparatory work that random.choices
has to do every time it's called, prior to generating any samples; by generating many samples at once, we only have to do that preparatory work once. Here we generate a million samples, and use collections.Counter
to check that the distribution we get roughly matches the weights we gave.
>>> million_samples = choices(population, weights, k=10**6)
>>> from collections import Counter
>>> Counter(million_samples)
Counter({5: 399616, 6: 200387, 4: 200117, 1: 99636, 3: 50219, 2: 50025})
scipy.stats.rv_discrete
might be what you want. You can supply your probabilities via the values
parameter. You can then use the rvs()
method of the distribution object to generate random numbers.
As pointed out by Eugene Pakhomov in the comments, you can also pass a p
keyword parameter to numpy.random.choice()
, e.g.
numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])
If you are using Python 3.6 or above, you can use random.choices()
from the standard library – see the answer by Mark Dickinson.