How to choose keys from a python dictionary based on weighted probability?
Starting in Python 3.6, you can use the built-in random.choices()
instead of having to use Numpy.
So then, if we want to sample (with replacement) 25 keys from your dictionary where the values are the weights/probabilities of being sampled, we can simply write:
import random
random.choices(list(my_dict.keys()), weights=my_dict.values(), k=25)
This outputs a list of sampled keys:
['c', 'b', 'c', 'b', 'b', 'c', 'c', 'c', 'b', 'c', 'b', 'c', 'b', 'c', 'c', 'c', 'c', 'c', 'a', 'b']
If you just want one key, set k
to 1 and extract the single element from the list that random.choices
returns:
random.choices(list(my_dict.keys()), weights=my_dict.values(), k=1)[0]
(If you don't convert my_dict.keys()
to a list, you'll get a TypeError about how it's not subscriptable.)
Here's the relevant snippet from the docs:
random.choices(population, weights=None, *, cum_weights=None, k=1)
Return a k sized list of elements chosen from the population with replacement. If the population is empty, raises IndexError.
If a weights sequence is specified, selections are made according to the relative weights. Alternatively, if a cum_weights sequence is given, the selections are made according to the cumulative weights (perhaps computed using itertools.accumulate()). For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50]. Internally, the relative weights are converted to cumulative weights before making selections, so supplying the cumulative weights saves work.
If neither weights nor cum_weights are specified, selections are made with equal probability. If a weights sequence is supplied, it must be the same length as the population sequence. It is a TypeError to specify both weights and cum_weights.
The weights or cum_weights can use any numeric type that interoperates with the float values returned by random() (that includes integers, floats, and fractions but excludes decimals). Weights are assumed to be non-negative.
For a given seed, the choices() function with equal weighting typically produces a different sequence than repeated calls to choice(). The algorithm used by choices() uses floating point arithmetic for internal consistency and speed. The algorithm used by choice() defaults to integer arithmetic with repeated selections to avoid small biases from round-off error.
According to the comments at https://stackoverflow.com/a/39976962/5139284, random.choices
is faster for small arrays, and numpy.random.choice
is faster for big arrays. numpy.random.choice
also provides an option to sample without replacement, whereas there's no built-in Python standard library function for that.
def weighted_random_by_dct(dct):
rand_val = random.random()
total = 0
for k, v in dct.items():
total += v
if rand_val <= total:
return k
assert False, 'unreachable'
Should do the trick. Goes through each key and keeps a running sum and if the random value (between 0 and 1) falls in the slot it returns that key