Generating Random Pairs of Integers without Replacement in R
First, I found how to generate the pairs on SO. However, that didn't scale so I looked in up ?combn
and found the expand.grid
function.
Next, I use the data.table
package because it handles large data well (see it's documentation for the why).
## the data.table library does well with large data sets
library(data.table)
## Small dummy dataset
pairOne = 1:10
pairTwo = 1:2
nSamples = 3
system.time({
dt = data.table(expand.grid(pairOne, pairTwo))
dt2 = dt[sample(1:dim(dt)[1], size = nSamples), ]
})
# user system elapsed
# 0.002 0.001 0.001
## Large dummy dataset
pairOne = 1:10000
pairTwo = 1:10000
length(pairOne) * length(pairTwo)
nSamples = 1e5
system.time({
dt = data.table(expand.grid(pairOne, pairTwo))
dt2 = dt[sample(1:dim(dt)[1], size = nSamples), ]
})
# user system elapsed
# 2.576 1.276 3.862
Inspired by David Robinson's initial stab:
set.seed(1)
np <- 1000 # number of elements desired
M1 <- t(combn(1:np, 2))
sam <- sample(1:nrow(M1), np, replace = FALSE)
M2 <- M1[sam,]
anyDuplicated(M2) # returns FALSE
This would use all possible entries of M1
but in a random order. Is this what you wanted?
The key here is not to generate all the permutations as that is very expensive memory and time wise. Since you only care about two numbers, we can do this very easily so long as the (number_of_possible_values) ^ 2
is less than the largest representable integer in double precision floating point:
size <- 1e5
samples <- 100
vals <- sample.int(size ^ 2, samples)
cbind(vals %/% size + 1, vals %% size)
Basically, we use integers to represent every possible combination of values. In our example, we sample from all the numbers up to 1e5 ^ 2
, since we have 1e5 ^ 2
possible combinations of 1e5
numbers. Each of those 1e10
integers represents one of the combinations. We then decompose that integer into the two component values by taking the modulo, as the first number, and the integer division as the second.
Benchmarks:
Unit: microseconds
expr min lq mean
funBrodie(10000, 100) 16.457 17.188 22.052
funRichard(10000, 100) 542513.717 640647.919 638045.215
Also, limit should be ~3x1e7, and remains relatively fast:
Unit: microseconds
expr min lq mean median uq max neval
funBrodie(1e+07, 100) 18.285 20.6625 22.88209 21.211 22.4905 77.893 100
Functions for benchmarking:
funRichard <- function(size, samples) {
nums <- 1:size
dt = CJ(nums, nums)
dt[sample(1:dim(dt)[1], size = samples), ]
}
funBrodie <- function(size, samples) {
vals <- sample.int(size ^ 2, samples)
cbind(vals %/% size + 1, vals %% size)
}
And confirm we're doing similar things (note it's not a given these should be exactly the same, but it turns out they are):
set.seed(1)
resB <- funBrodie(1e4, 100)
set.seed(1)
resR <- unname(as.matrix(funRichard(1e4, 100)))
all.equal(resB, resR)
# TRUE