Select last row by group for all columns data.table
Last row by group :
DT[, .SD[.N], by="TRADER_ID,EXEC_IDATE"] # (1)
or, faster (avoid use of .SD
where possible, for speed) :
w = DT[, .I[.N], by="TRADER_ID,EXEC_IDATE"][[3]] # (2)
DT[w]
Note that the following feature request will make approach (1) as fast as approach (2) :
FR#2330 Optimize .SD[i] query to keep the elegance but make it faster unchanged.
How about something like this? (Synthetic data meant to mimic what I can infer about yours from the question)
tmp <- data.table(id = sample(1:20, 1e6, replace=TRUE),
date = as.Date(as.integer(runif(n=1e6, min = 1e4, max = 1.1e4)),
origin = as.Date("1970-01-01")),
data1 = rnorm(1e6),
data2 = rnorm(1e6),
data3 = rnorm(1e6))
> system.time(X <- tmp[, lapply(.SD, tail, 1), by = list(id, date)])
user system elapsed
1.95 0.00 1.95
> system.time(Y <- tmp[, list(tail(data1, 1)), by = list(id, date)])
user system elapsed
1.24 0.01 1.26
> system.time({
setkey(tmp, id, date)
Z <- tmp[unique(tmp)[, key(tmp), with=FALSE], mult="last"]
})
user system elapsed
0.90 0.02 0.92
X and Z are the same after same order is ensured:
> identical(setkey(X, id, date), setkey(Z, id, date))
[1] TRUE
The difference between my lapply
tail
and 1-column tail
isn't as drastic as yours, but without the structure of your data, it's hard to say more.
Also, note that most of the time in this method is setting the key. If the table is already sorted by the grouping columns, it goes really fast:
> system.time(Z <- tmp[unique(tmp)[, key(tmp), with=FALSE], mult="last"])
user system elapsed
0.03 0.00 0.03
Alternatively, you could translate the many column problem to the 1-column problem with a temporary column:
> system.time({
tmp[, row.num := seq_len(nrow(tmp))]
W <- tmp[tmp[, max(row.num), by = list(id, date)]$V1][, row.num := NULL]
tmp[, row.num := NULL]
})
user system elapsed
0.92 0.00 1.09
> identical(setkey(X, id, date), setkey(W, id, date))
[1] TRUE