Apply different functions to different sets of columns by group
As always with this sort of programmatic use of [.data.table
, the general strategy is to construct an expression e
that that can be evaluated in the j
argument. Once you understand that (as I'm sure you do), it just becomes a game of computing on the language to get a j
-slot expression that looks like what you'd write at the command line.
Here, for instance, and given the particular values in your example, you'd like a call that looks like:
test[, list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1]),
by=c("g1", "g2")]
so the expression you'd like evaluated in the j
-slot is
list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1])
Most of the following function is taken up with constructing just that expression:
f <- function(dt, bycols, datacols, nvcols) {
e <- c(sapply(datacols, function(x) call("sum", as.symbol(x))),
sapply(nvcols, function(x) call("[", as.symbol(x), 1)))
e<- as.call(c(as.symbol("list"), e))
dt[,eval(e), by=bycols]
}
f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
## g1 g2 dat1 dat2 nv1 nv2
## 1: a A 0.8403809 0.6713090 1 1
## 2: b A 0.4491883 0.4607716 2 2
## 3: c A 0.6083939 1.2031960 3 3
## 4: d A 1.5510033 1.2945761 4 4
## 5: e A 1.1302971 0.8573135 5 5
## ---
## 126: v Z 0.5627018 0.4282380 126 126
## 127: w Z 0.7588966 1.4429034 127 127
## 128: x Z 0.7060596 1.3736510 128 128
## 129: y Z 0.6015249 0.4488285 129 129
## 130: z Z 1.5304034 1.6012207 130 130
Here's what I had come up with. It works, but very slowly.
test[, {
cbind(
as.data.frame( t( sapply( .SD[, ..datacols], sum ) ) ),
.SD[, ..nonvaryingcols][1]
)
}, by = bycols ]
Benchmarks
FunJosh <- function() {
f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
}
FunAri <- function() {
test[, {
cbind(
as.data.frame( t( sapply( .SD[, ..datacols], sum ) ) ),
.SD[, ..nonvaryingcols][1]
)
}, by = bycols ]
}
FunEddi <- function() {
cbind(
test[, lapply(.SD, sum), by = bycols, .SDcols = datacols],
test[, lapply(.SD, "[", 1), by = bycols, .SDcols = nonvaryingcols][, ..nonvaryingcols]
)
}
library(microbenchmark)
identical(FunJosh(), FunAri())
# [1] TRUE
microbenchmark(FunJosh(), FunAri(), FunEddi())
#Unit: milliseconds
# expr min lq median uq max neval
# FunJosh() 2.749164 2.958478 3.098998 3.470937 6.863933 100
# FunAri() 246.082760 255.273839 284.485654 360.471469 509.740240 100
# FunEddi() 5.877494 6.229739 6.528205 7.375939 112.895573 100
At least two orders of magnitude slower than @joshobrien's solution. Edit @Eddi's solution is much faster as well, and shows that cbind wasn't optimal but could be fairly fast in the right hands. Might be all the transforming and sapply
ing I was doing rather than just directly using lapply
.
Just for a bit of variety, here is a variant of @Josh O'brien's solution that uses the bquote
operator instead of call
. I did try to replace the final as.call with a bquote, but because bquote doesn't support list splicing (e.g., see this question), I couldn't get that to work.
f <- function(dt, bycols, datacols, nvcols) {
datacols = sapply(datacols, as.symbol)
nvcols = sapply(nvcols, as.symbol)
e = c(lapply(datacols, function(x) bquote(sum(.(x)))),
lapply(nvcols, function(x) bquote(.(x)[1])))
e = as.call(c(as.symbol("list"), e))
dt[,eval(e), by=bycols]
}
> f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
g1 g2 dat1 dat2 nv1 nv2
1: a A 0.8404 0.6713 1 1
2: b A 0.4492 0.4608 2 2
3: c A 0.6084 1.2032 3 3
4: d A 1.5510 1.2946 4 4
5: e A 1.1303 0.8573 5 5
---
126: v Z 0.5627 0.4282 126 126
127: w Z 0.7589 1.4429 127 127
128: x Z 0.7061 1.3737 128 128
129: y Z 0.6015 0.4488 129 129
130: z Z 1.5304 1.6012 130 130
>