Fast calculation of CDF / rolling join on multiple columns
Just a note on the alternative, and yet, obvious solution:
set.seed(2016)
dt <- data.table(x=rnorm(20000), y=rnorm(20000), z=rnorm(20000))
system.time({
dt <- t(as.matrix(dt))
bounds <- as.matrix(expand.grid(z=seq(-2,2,0.1),
y=seq(-2,2,0.1),
x=seq(-2,2,0.1)))
bounds <- bounds[,ncol(bounds):1]
n_d <- ncol(bounds)
x <- apply(bounds,
1,
function(x) sum(colSums(dt < x) == n_d))
})
This solution on my machine takes roughly twice as long to compute as JoeBase and OldBenDT solutions posted. The main difference? Memory usage. It is more processor-bound.
I don't know a precise way to compare memory usage in R, but the memory.size(max=T)
function reported using 5Gb of memory for those previous approaches (not the non-equi-join approach), whilst only using 40Mb of memory for the apply
approach (note: I used 20000 points in the sample distribution).
I think this has important implications for the scale of the computation you can execute.
Update
Below, I have provided a general base R
solution (it will work on non-uniform grids). It is was faster than the fastest published solution provided by the OP (more on this later). As the OP intimates, generating the N.cum
column is the real bottleneck, so I focused my efforts on this task alone (i.e. generating CDF
is a trivial task once N.cum
is obtained).
JoeBase <- function(dtt, s) {
m <- matrix(c(dtt$x, dtt$y, dtt$z), ncol = 3)
N.Cum <- array(vector(mode = "integer"), dim = rev(sapply(s, length)))
for (i in seq_along(s[[1]])) {
t1 <- m[,1] <= s[[1]][i]
for (j in seq_along(s[[2]])) {
t2 <- t1 & (m[,2] <= s[[2]][j])
for (k in seq_along(s[[3]])) {
N.Cum[k,j,i] <- sum(t2 & (m[,3] <= s[[3]][k]))
}
}
}
as.vector(N.Cum)
}
The above algorithm takes advantage of vectorized operations, notably the creation and utilization of the logical vectors t1
and t2
. This vector is used to obtain the number of rows that met the criteria for all 3 columns in the original data.table. We simply rely on the internal coercion by R from a logical vector to an integral vector by the action of sum
.
Figuring out how to populate the 3-dimensional integer array N.Cum
was a bit of a challenge, as it is later going to be converted to a vector via as.vector
. This took a bit of trial and error to learn how as.vector
behaves. To my surprise, the "last" and "first" dimension must be permuted in order for the coercion to a vector to take place faithfully (The first few times through, I had N.Cum[i,j,k] instead of N.Cum[k,j,i]).
First, lets test equality:
library(data.table)
## Here is the function I used to test against. I included the generation
## of "bounds" and "bg" as "result" depends on both of these (N.B. "JoeBase" does not)
BenDT <- function(dt, s) {
X <- data.table(X=s[[1]]); X[, x := X]
Y <- data.table(Y=s[[2]]); Y[, y := Y]
Z <- data.table(Z=s[[3]]); Z[, z := Z]
dt <- X[dt, on="x", roll=-Inf, nomatch=0]
dt <- Y[dt, on="y", roll=-Inf, nomatch=0]
dt <- Z[dt, on="z", roll=-Inf, nomatch=0]
bg <- dt[, .N, keyby=list(X, Y, Z)]
bounds <- CJ(X=X$X, Y=Y$Y, Z=Z$Z)
kl <- bg[bounds, on=c("X", "Y", "Z")]
kl[is.na(N), N := 0]
# Counting
kl[, CountUntil.XgivenYZ := cumsum(N), by=list(Y, Z)]
kl[, CountUntil.XYgivenZ := cumsum(CountUntil.XgivenYZ), by=list(X, Z)]
kl[, CountUntil.XYZ := cumsum(CountUntil.XYgivenZ), by=list(X, Y)]
# Cleanup
setnames(kl, "CountUntil.XYZ", "N.cum")
kl[, CDF := N.cum/nrow(dt)]
kl
}
t1 <- BenDT(dt, seq(-2,2,0.1))
t2 <- JoeBase(dt, seq(-2,2,0.1))
all.equal(t1$N.cum, t2)
[1] TRUE
Now, we test for speed. First we will compile both functions using cmpfun
from the compiler
package. The first benchmark reflects the efficiency on smaller examples.
library(compiler)
c.JoeBase <- cmpfun(JoeBase)
c.BenDT <- cmpfun(BenDT)
c.OldBenDT <- cmpfun(OldBenDT) ## The previous best solution that Ben contributed
st <- list(seq(-2, 2, 0.1), seq(-2, 2, 0.1), seq(-2, 2, 0.1))
microbenchmark(c.BenDT(dt, st), c.OldBenDT(dt, st), c.JoeBase(dt, st), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
c.BenDT(dt, st) 34.24872 34.78908 38.87775 37.4924 43.37179 46.12859 10 a
c.OldBenDT(dt, st) 1485.68178 1532.35878 1607.96669 1593.9813 1619.58908 1845.75876 10 b
c.JoeBase(dt, st) 1880.71648 1962.38160 2049.43985 2007.4880 2169.93078 2281.02118 10 c
Below is the old test.
However, when the number of bins increases, c.JoeBase
really starts to dominate (over 5x faster).
st <- list(seq(-5, 5, 0.1), seq(-5, 5, 0.1), seq(-5, 5, 0.1))
microbenchmark(c.JoeBase(dt, st), c.OldBenDT(dt, st), times = 5)
Unit: seconds
expr min lq mean median uq max neval cld
c.JoeBase(dt, st) 23.50927 23.53809 29.61145 24.52748 30.81485 45.66759 5 a
c.OldBenDT(dt, st) 110.60209 123.95285 133.74601 124.97929 125.96186 183.23394 5 b
After performing further tests, I have some misgivings about the results (@Ben has pointed out a similar sentiment in the comments). I'm pretty sure c.JoeBase
appears to be faster only because of the limitations of my old computer. As @stephematician pointed out in his answer, the original solution is memory intensive and if you were to simply perform a system.time
on c.OldBenDT
, you will see that the majority of the time is being spent in the system
category and the user
category is comparable to the user
category of c.JoeBase
. My 6 year old Mac has only 4GB of ram and I surmise that a lot of memory swaps are taking place with these operations. Observe:
## test with very tiny buckets (i.e. 0.025 instead of 0.1 above)
st <- list(seq(-1.5, 1.5, 0.025), seq(-1.5, 1.5, 0.025), seq(-1.5, 1.5, 0.025))
system.time(c.JoeBase(dt, st))
user system elapsed
36.407 4.748 41.170
system.time(c.OldBenDT(dt, st))
user system elapsed
49.653 77.954 475.304
system.time(c.BenDT(dt, st)) ## Ben's new solution is lightning fast
user system elapsed
0.603 0.063 0.668
Regardless, @Ben's latest solution is far superior. Check out these new benchmarks:
st <- list(seq(-5, 5, 0.1), seq(-5, 5, 0.1), seq(-5, 5, 0.1))
microbenchmark(c.JoeBase(dt, st), BenDT(dt, st), times = 5)
Unit: milliseconds
expr min lq mean median uq max neval cld
c.JoeBase(dt, st) 26517.0944 26855.7819 28341.5356 28403.7871 29926.213 30004.8018 5 b
BenDT(dt, st) 342.4433 359.8048 400.3914 379.5319 423.336 496.8411 5 a
Yet another victory for data.table
.
Figured it out.
# Step1 - map each sample to the nearest X, Y, and Z above it. (In other words, bin the data.)
X <- data.table(X=seq(-2, 2, by=.1)); X[, x := X]
Y <- data.table(Y=seq(-2, 2, by=.1)); Y[, y := Y]
Z <- data.table(Z=seq(-2, 2, by=.1)); Z[, z := Z]
dt <- X[dt, on="x", roll=-Inf, nomatch=0]
dt <- Y[dt, on="y", roll=-Inf, nomatch=0]
dt <- Z[dt, on="z", roll=-Inf, nomatch=0]
# Step2 - aggregate by unique (X, Y, Z) triplets and count the samples directly below each of these bounds.
bg <- dt[, .N, keyby=list(X, Y, Z)]
# Step4 - Get the count of samples directly below EVERY (X, Y, Z) bound
bounds <- CJ(X=X$X, Y=Y$Y, Z=Z$Z)
kl <- bg[bounds, on=c("X", "Y", "Z")]
kl[is.na(N), N := 0]
# Step5 (the tricky part) - Consider a single (Y, Z) pair. X will be in ascending order. So we can do a cumsum on X for each (Y, Z) to count x <= X | Y,Z. Now if you hold X and Z fixed, you can do a cumsum on Y (which is also in ascending order) to count x <= X, y <= Y | Z. And then just continue this process.
kl[, CountUntil.XgivenYZ := cumsum(N), by=list(Y, Z)]
kl[, CountUntil.XYgivenZ := cumsum(CountUntil.XgivenYZ), by=list(X, Z)]
kl[, CountUntil.XYZ := cumsum(CountUntil.XYgivenZ), by=list(X, Y)]
# Cleanup
setnames(kl, "CountUntil.XYZ", "N.cum")
kl[, CDF := N.cum/nrow(dt)]
Generalization
For anyone who wants it, I generalized this to work with any number of variables and dumped the function into my R package, mltools.
For example, to solve this problem you can do
library(mltools)
bounds <- list(x=seq(-2, 2, by=.1), y=seq(-2, 2, by=.1), z=seq(-2, 2, by=.1))
empirical_cdf(x=dt, ubounds=bounds)
x y z N.cum CDF
1: -2 -2 -2.0 0 0.000
2: -2 -2 -1.9 0 0.000
3: -2 -2 -1.8 0 0.000
4: -2 -2 -1.7 0 0.000
5: -2 -2 -1.6 0 0.000
---
68917: 2 2 1.6 899 0.899
68918: 2 2 1.7 909 0.909
68919: 2 2 1.8 917 0.917
68920: 2 2 1.9 924 0.924
68921: 2 2 2.0 929 0.929