C++ randomly sample k numbers from range 0:n-1 (n > k) without replacement

Here's an approach that doesn't require generating and shuffling a huge list, in case N is huge but k is not:

std::vector<int> pick(int N, int k) {
    std::random_device rd;
    std::mt19937 gen(rd());

    std::unordered_set<int> elems = pickSet(N, k, gen);

    // ok, now we have a set of k elements. but now
    // it's in a [unknown] deterministic order.
    // so we have to shuffle it:

    std::vector<int> result(elems.begin(), elems.end());
    std::shuffle(result.begin(), result.end(), gen);
    return result;
}

Now the naive approach of implementing pickSet is:

std::unordered_set<int> pickSet(int N, int k, std::mt19937& gen)
{
    std::uniform_int_distribution<> dis(1, N);
    std::unordered_set<int> elems;

    while (elems.size() < k) {
        elems.insert(dis(gen));
    }

    return elems;
}

But if k is large relative to N, this algorithm could lead to lots of collisions and could be pretty slow. We can do better by guaranteeing that we can add one element on each insertion (brought to you by Robert Floyd):

std::unordered_set<int> pickSet(int N, int k, std::mt19937& gen)
{
    std::unordered_set<int> elems;
    for (int r = N - k; r < N; ++r) {
        int v = std::uniform_int_distribution<>(1, r)(gen);

        // there are two cases.
        // v is not in candidates ==> add it
        // v is in candidates ==> well, r is definitely not, because
        // this is the first iteration in the loop that we could've
        // picked something that big.

        if (!elems.insert(v).second) {
            elems.insert(r);
        }   
    }
    return elems;
}

Bob Floyd created a random sample algorithm that uses sets. The intermediate structure size is proportional to the sample size you want to take.

It works by randomly generating K numbers and adding them to a set. If a generated number happens to already exist in the set, it places the value of a counter instead which is guaranteed to have not been seen yet. Thus it is guaranteed to run in linear time and does not require a large intermediate structure. It still has pretty good random distribution properties.

This code is basically lifted from Programming Pearls with some modifications to use more modern C++.

unordered_set<int> BobFloydAlgo(int sampleSize, int rangeUpperBound)
{
     unordered_set<int> sample;
     default_random_engine generator;

     for(int d = rangeUpperBound - sampleSize; d < rangeUpperBound; d++)
     {
           int t = uniform_int_distribution<>(0, d)(generator);
           if (sample.find(t) == sample.end() )
               sample.insert(t);
           else
               sample.insert(d);
     }
     return sample;
}

This code has not been tested.


Starting from C++17, there's a standard function for that: std::sample in <algorithm> library. It is guaranteed to have linear time complexity.

Sample (pun intended) usage:

#include <algorithm>
#include <iostream>
#include <iterator>
#include <random>
#include <vector>

int main()
{
    std::vector<int> population {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    std::vector<int> sample;
    std::sample(population.begin(), population.end(), 
                std::back_inserter(sample),
                5,
                std::mt19937{std::random_device{}()});
    for(int i: sample)
        std::cout << i << " "; //prints 5 randomly chosen values from population vector

Tags:

C++

Random