Select last row by group for all columns data.table

Last row by group :

DT[, .SD[.N], by="TRADER_ID,EXEC_IDATE"]            # (1)

or, faster (avoid use of .SD where possible, for speed) :

w = DT[, .I[.N], by="TRADER_ID,EXEC_IDATE"][[3]]    # (2)
DT[w]

Note that the following feature request will make approach (1) as fast as approach (2) :

FR#2330 Optimize .SD[i] query to keep the elegance but make it faster unchanged.


How about something like this? (Synthetic data meant to mimic what I can infer about yours from the question)

tmp <- data.table(id = sample(1:20, 1e6, replace=TRUE),
                  date = as.Date(as.integer(runif(n=1e6, min = 1e4, max = 1.1e4)),
                                 origin = as.Date("1970-01-01")),
                  data1 = rnorm(1e6),
                  data2 = rnorm(1e6),
                  data3 = rnorm(1e6))

> system.time(X <- tmp[, lapply(.SD, tail, 1), by = list(id, date)])
  user  system elapsed 
  1.95    0.00    1.95

> system.time(Y <- tmp[, list(tail(data1, 1)), by = list(id, date)])
  user  system elapsed 
  1.24    0.01    1.26 

> system.time({
    setkey(tmp, id, date)
    Z <- tmp[unique(tmp)[, key(tmp), with=FALSE], mult="last"]
})
  user  system elapsed 
  0.90    0.02    0.92 

X and Z are the same after same order is ensured:

> identical(setkey(X, id, date), setkey(Z, id, date))
[1] TRUE

The difference between my lapply tail and 1-column tail isn't as drastic as yours, but without the structure of your data, it's hard to say more.

Also, note that most of the time in this method is setting the key. If the table is already sorted by the grouping columns, it goes really fast:

> system.time(Z <- tmp[unique(tmp)[, key(tmp), with=FALSE], mult="last"])
  user  system elapsed 
  0.03    0.00    0.03 

Alternatively, you could translate the many column problem to the 1-column problem with a temporary column:

> system.time({
  tmp[, row.num := seq_len(nrow(tmp))]
  W <- tmp[tmp[, max(row.num), by = list(id, date)]$V1][, row.num := NULL]
  tmp[, row.num := NULL]
})
user  system elapsed 
0.92    0.00    1.09 

> identical(setkey(X, id, date), setkey(W, id, date))
[1] TRUE

Tags:

R

Data.Table