Weighted random numbers in MATLAB

Amro has a really nice answer for this topic. However, one might want a super-fast implementation to sample from huge PDFs where the domain might contain several thousands. For such scenarios, it might be tedious to use bsxfun and cumsum very frequently. Motivated from Gnovice's answer, it would make sense to implement roulette wheel algorithm with a run length encoding schema. I performed a benchmark with Amro's solution and new code:

%% Toy example: generate random numbers from an arbitrary PDF

a = 1:3;                                %# domain of PDF
w = [0.3 0.1 0.2];                      %# Probability Values (Weights)
N = 10000;                              %# Number of random generations

%Generate using roulette wheel + run length encoding
factor = 1 / min(w);                    %Compute min factor to assign 1 bin to min(PDF)
intW = int32(w * factor);               %Get replicator indexes for run length encoding
idxArr = zeros(1,sum(intW));            %Create index access array
idxArr([1 cumsum(intW(1:end-1))+1]) = 1;%Tag sample change indexes
sampTable = a(cumsum(idxArr));          %Create lookup table filled with samples
len = size(sampTable,2);

tic;
R = sampTable( uint32(randi([1 len],N,1)) );
toc;
tabulate(R);

Some evaluations of the code above for very large data where domain of PDF contain huge length.

a ~ 15000, n = 10000
Without table: Elapsed time is 0.006203 seconds.
With table:    Elapsed time is 0.003308 seconds.
ByteSize(sampTable) 796.23 kb

a ~ 15000, n = 100000
Without table: Elapsed time is 0.003510 seconds.
With table:    Elapsed time is 0.002823 seconds.

a ~ 35000, n = 10000
Without table: Elapsed time is 0.226990 seconds.
With table:    Elapsed time is 0.001328 seconds.
ByteSize(sampTable) 2.79 Mb

a ~ 35000  n = 100000
Without table: Elapsed time is 2.784713 seconds.
With table:    Elapsed time is 0.003452 seconds.

a ~ 35000  n = 1000000
Without table: bsxfun: out of memory
With table   : Elapsed time is 0.021093 seconds.

The idea is to create a run length encoding table where frequent values of the PDF are replicated more compared to non-frequent values. At the end of the day, we sample an index for weighted sample table, using uniform distribution, and use corresponding value.

It is memory intensive, but with this approach it is even possible to scale up to PDF lengths of hundred thousands. Hence access is super-fast.


R = randsample([1 2 3], N, true, [0.3 0.1 0.2])

randsample is included in the Statistics Toolbox


Otherwise you can use some kind of roulette-wheel selection process. See this similar question (although not MATLAB specific). Here's my one-line implementation:

a = 1:3;             %# possible numbers
w = [0.3 0.1 0.2];   %# corresponding weights
N = 10;              %# how many numbers to generate

R = a( sum( bsxfun(@ge, rand(N,1), cumsum(w./sum(w))), 2) + 1 )

Explanation:

Consider the interval [0,1]. We assign for each element in the list (1:3) a sub-interval of length proportionate to the weight of each element; therefore 1 get and interval of length 0.3/(0.3+0.1+0.2), same for the others.

Now if we generate a random number with uniform distribution over [0,1], then any number in [0,1] has an equal probability of being picked, thus the sub-intervals' lengths determine the probability of the random number falling in each interval.

This matches what I'm doing above: pick a number X~U[0,1] (more like N numbers), then find which interval it falls into in a vectorized way..


You can check the results of the two techniques above by generating a large enough sequence N=1000:

>> tabulate( R )
  Value    Count   Percent
      1      511     51.10%
      2      160     16.00%
      3      329     32.90%

which more or less match the normalized weights w./sum(w) [0.5 0.16667 0.33333]


amro gives a nice answer (that I rated up), but it will be highly intensive if you wish to generate many numbers from a large set. This is because the bsxfun operation can generate a huge array, which is then summed. For example, suppose I had a set of 10000 values to sample from, all with different weights? Now, generate 1000000 numbers from that sample.

This will take some work to do, since it will generate a 10000x1000000 array internally, with 10^10 elements in it. It will be a logical array, but even so, 10 gigabytes of ram must be allocated.

A better solution is to use histc. Thus...

a = 1:3
w = [.3 .1 .2];
N = 10;

[~,R] = histc(rand(1,N),cumsum([0;w(:)./sum(w)]));
R = a(R)
R =
     1     1     1     2     2     1     3     1     1     1

However, for a large problem of the size I suggested above, it is fast.

a = 1:10000;
w = rand(1,10000);
N = 1000000;

tic
[~,R] = histc(rand(1,N),cumsum([0;w(:)./sum(w)]));
R = a(R);
toc
Elapsed time is 0.120879 seconds.

Admittedly, my version takes 2 lines to write. The indexing operation must happen on a second line since it uses the second output of histc. Also note that I've used the ability of the new matlab release, with the tilde (~) operator as the first argument of histc. This causes that first argument to be immediately dumped in the bit bucket.


TL;DR

For maximum performance, if you only need a singe sample, use

R = a( sum( (rand(1) >= cumsum(w./sum(w)))) + 1 );

and if you need multiple samples, use

[~, R] = histc(rand(N,1),cumsum([0;w(:)./sum(w)]));

Avoid randsample. Generating multiple samples upfront is three orders of magnitude faster than generating individual values.


Performance metrics

Since this showed up near the top of my Google search, I just wanted to add some performance metrics to show that the right solution will depend very much on the value of N and the requirements of the application. Also that changing the design of the application can dramatically increase performance.

For large N, or indeed N > 1:

a = 1:3;             % possible numbers
w = [0.3 0.1 0.2];   % corresponding weights
N = 100000000;       % number of values to generate

w_normalized = w / sum(w)  % normalised weights, for indication

fprintf('randsample:\n');
tic
R = randsample(a, N, true, w);
toc
tabulate(R)

fprintf('bsxfun:\n');
tic
R = a( sum( bsxfun(@ge, rand(N,1), cumsum(w./sum(w))), 2) + 1 );
toc
tabulate(R)

fprintf('histc:\n');
tic
[~, R] = histc(rand(N,1),cumsum([0;w(:)./sum(w)]));
toc
tabulate(R)

Results:

w_normalized =

    0.5000    0.1667    0.3333

randsample:
Elapsed time is 2.976893 seconds.
  Value    Count   Percent
      1    49997864     50.00%
      2    16670394     16.67%
      3    33331742     33.33%
bsxfun:
Elapsed time is 2.712315 seconds.
  Value    Count   Percent
      1    49996820     50.00%
      2    16665005     16.67%
      3    33338175     33.34%
histc:
Elapsed time is 2.078809 seconds.
  Value    Count   Percent
      1    50004044     50.00%
      2    16665508     16.67%
      3    33330448     33.33%

In this case, histc is fastest

However, in the case where maybe it is not possible to generate all N values up front, perhaps because the weights are updated on each iterations, i.e. N=1:

a = 1:3;             % possible numbers
w = [0.3 0.1 0.2];   % corresponding weights
I = 100000;          % number of values to generate

w_normalized = w / sum(w)  % normalised weights, for indication

R=zeros(N,1);

fprintf('randsample:\n');
tic
for i=1:I
    R(i) = randsample(a, 1, true, w);
end
toc
tabulate(R)

fprintf('cumsum:\n');
tic
for i=1:I
    R(i) = a( sum( (rand(1) >= cumsum(w./sum(w)))) + 1 );
end
toc
tabulate(R)

fprintf('histc:\n');
tic
for i=1:I
    [~, R(i)] = histc(rand(1),cumsum([0;w(:)./sum(w)]));
end
toc
tabulate(R)

Results:

    0.5000    0.1667    0.3333

randsample:
Elapsed time is 3.526473 seconds.
  Value    Count   Percent
      1    50437     50.44%
      2    16149     16.15%
      3    33414     33.41%
cumsum:
Elapsed time is 0.473207 seconds.
  Value    Count   Percent
      1    50018     50.02%
      2    16748     16.75%
      3    33234     33.23%
histc:
Elapsed time is 1.046981 seconds.
  Value    Count   Percent
      1    50134     50.13%
      2    16684     16.68%
      3    33182     33.18%

In this case, the custom cumsum approach (based on the bsxfun version) is fastest.

In any case, randsample certainly looks like a bad choice all round. It also goes to show that if an algorithm can be arranged to generate all random variables upfront then it will perform much better (note that there are three orders of magnitude less values generated in the N=1 case in a similar execution time).

Code is available here.

Tags:

Matlab

Random