Efficiently picking combinations of Integers

Great question! This comes up in survey design, where you want a few different versions of the survey that each only contain a subset of the questions, but you want every pair (or t-tuple) of questions to have been asked at least once.

This is called covering design, and is a variant of the classic set cover problem. As you can read in an excellent Mathematics Stack Exchange post on the topic, folks use notation C(v, k, t) indicating the minimum number of k-element subsets you need to draw (k=3 in your case) from a v-element set (v=5 in your case) such that every t-element subset in the entire set (t=2 in your case) is contained within one of your selected subsets. Folks have evaluated this function for many different (v, k, t) tuples; see, for instance, https://ljcr.dmgordon.org/cover/table.html . We can read from that table that C(5, 3, 2) = 4, with the following as one possible design:

  1  2  3
  1  4  5
  2  3  4
  2  3  5

First and foremost, this problem is NP-hard, so all known exact algorithms will scale exponentially in inputs v, k, and t. So while you may be able to solve small instances exactly by enumeration or some more clever exact method (e.g. integer programming), you will likely need to resort to heuristic methods as the problem size gets very large.

One possibility in this direction is lexicographic covering, as proposed in https://arxiv.org/pdf/math/9502238.pdf (you will note that many of the solutions on the site linked above list "lex covering" as the method of construction). Basically you list out all possible k-tuples in lexicographic order:

123
124
125
134
135
145
234
235
245
345

Then you greedily add the k-tuple that covers the most previously uncovered t-tuples, breaking ties using the lexicographic ordering.

Here's how the algorithm works in our case:

  1. At the beginning every 3-tuple covers 3 different 2-tuples, so we add 123 since it is lexicographically earliest.

  2. After doing this, the 2-tuples of 12, 13, and 23 have been covered, while all remaining 2-tuples are not covered. A number of 3-tuples cover 3 more 2-tuples, e.g. 145 and 245. We pick 145, since it is lexicographically first, covering 14, 45, and 15.

  3. Now we have 4 remaining uncovered 2-tuples -- 24, 25, 34, and 35. No 3-tuple covers 3 of these, but several cover 2, e.g. 234 and 345. We select 234 as the lexicographically earliest.

  4. We have two remaining uncovered 2-tuples -- 25 and 35. We select 235 as the only 3-tuple that covers both.

We end up with the exact solution shown above. Importantly, this is just a heuristic method -- it doesn't give any guarantee that 4 is the smallest number of 3-tuples needed to cover all pairs in a set with 5 elements. In this case, a lower bound by Schönheim (a reference is provided in the linked article above) convinces us that, in fact, C(5, 3, 2) cannot be smaller than 4. We conclude that the solution from lexicographic covering is in fact optimal.

You would need a tweak to cover each t-tuple a certain number of times r. One obvious one would just be to repeat each tuple to be covered "r" times, and then run lex covering as usual (so for instance in the first step above each 3-tuple would cover 9 2-tuples with r=3). Of course this remains a heuristic for your overall problem due to the use of lex covering.


Here is an option using data.table to keep track of the matrix count and RcppAlgos to generate the combinations:

library(RcppAlgos)
library(data.table)

M <- 100 #5 #10 #100
sz <- 5 #3 #4 5 
minpick <- 3 #1 #2
d <- integer(M)

system.time({
    universe <- as.data.table(comboGeneral(M, 2L, nThreads=4L))[, count := 0L]
    ntuples <- 0
    while (universe[, any(count < minpick)]) {
        v <- universe[order(count), head(unique(c(V1[1L:2L], V2[1L:2L])), sz)]
        universe[as.data.table(comboGeneral(v, 2L, nThreads=4L)), on=.NATURAL, count := count + 1L]
        ntuples = ntuples + 1L
    }
    ntuples
})
#   user  system elapsed 
#  26.82    9.81   28.75 

m <- matrix(0L, nrow=M, ncol=M)
m[as.matrix(universe[, V1:V2])] <- universe$count
m + t(m) + diag(d)

It is a greedy algorithm, hence, I am not sure if this will result a minimum number of tuples.


Since this question asks for algorithmic approaches to covering designs, I'll provide one that gives exact answers (aka the best possible design) using integer programming in R. For every single k-tuple that you are considering (k=3 for you, since you are selecting triplets), define a decision variable that takes value 1 if you include it in your design and 0 if not. So in your case, you would define x_123 to indicate if tuple (1,2,3) is selected, x_345 for (3,4,5), and so on.

The objective of the optimization model is to minimize the number of tuples selected, aka the sum of all your decision variables. However, for every t-tuple (t=2 in your case), you need to include a decision variable that contains that t-tuple. This yields a constraint for every t-tuple. As an example, we would have x_123+x_124+x_125 >= 1 would be the constraint that requires the pair 12 to be in some selected tuple.

This yields the following optimization model:

min  x_123+x_124+...+x_345
s.t. x_123+x_124+x_125 >= 1  # constraint for 12
     x_123+x_134+x_135 >= 1  # constraint for 13
     ...
     x_145+x_245+x_345 >= 1  # constraint for 45
     x_ijk binary for all i, j, k

You could extend this to requiring r repeats of every t-tuple by changing the right-hand side of every inequality to "r" and requiring all the variables to be integer instead of binary.

This is easy to solve with a package like lpSolve in R:

library(lpSolve)
C <- function(v, k, tt, r) {
  k.tuples <- combn(v, k)
  t.tuples <- combn(v, tt)
  mod <- lp(direction="min",
            objective.in=rep(1, ncol(k.tuples)),
            const.mat=t(apply(t.tuples, 2, function(x) {
              apply(k.tuples, 2, function(y) as.numeric(sum(x %in% y) == tt))
            })),
            const.dir=rep(">=", ncol(t.tuples)),
            const.rhs=rep(r, ncol(t.tuples)),
            all.int=TRUE)
  k.tuples[,rep(seq_len(ncol(k.tuples)), round(mod$solution))]
}
C(5, 3, 2, 1)
#      [,1] [,2] [,3] [,4]
# [1,]    1    1    1    3
# [2,]    2    2    2    4
# [3,]    3    4    5    5
C(5, 3, 2, 3)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    1    1    1    1    1    1    2    2    2     3
# [2,]    2    2    2    3    3    4    3    3    4     4
# [3,]    3    4    5    4    5    5    4    5    5     5

While this solves your problem exactly, it will not scale well to large problem sizes. This is because the problem is NP-hard -- no known exact algorithm will scale well. If you need to solve large problem instances, then the heuristics recommended in other answers here are your best bet. Or you could solve with integer programming (as we do here) and set a timeout; then you will be working with the best solution found by your timeout, which is a heuristic solution to the problem overall.