R: using data.table := operations to calculate new columns
Another possible approach using later versions of data.table
:
library(data.table) #data.table_1.12.6 as of Nov 20, 2019
cols <- copy(names(DT))
DT[, c("MIN_DATE", "MAX_DATE") := .(DATE - 180L, DATE)]
DT[, PRIORAVG :=
.SD[.SD, on=.(TICKER, PERIOD, DATE>=MIN_DATE, DATE<=MAX_DATE),
by=.EACHI, {
subdat <- .SD[x.ID!=i.ID]
pavg <- if (subdat[, .N > 0L])
mean(subdat[, last(VALUE), ID]$V1, na.rm=TRUE)
else
NA_real_
c(setNames(mget(paste0("i.", cols)), cols), .(PRIORAVG=pavg))
}]$PRIORAVG
]
DT[, PREV := shift(VALUE), .(TICKER, PERIOD, ID)]
output:
TICKER PERIOD DATE ID VALUE MIN_DATE MAX_DATE PRIORAVG PREV
1: ABC 2010-12-31 2010-01-05 1 1.5 2009-07-09 2010-01-05 NA NA
2: ABC 2010-12-31 2010-01-08 1 1.4 2009-07-12 2010-01-08 1.30 1.5
3: ABC 2010-12-31 2010-01-10 1 1.4 2009-07-14 2010-01-10 1.45 1.4
4: ABC 2010-12-31 2010-01-13 1 1.5 2009-07-17 2010-01-13 1.40 1.4
5: ABC 2010-12-31 2010-04-01 1 1.7 2009-10-03 2010-04-01 1.40 1.5
6: ABC 2010-12-31 2010-01-07 2 1.3 2009-07-11 2010-01-07 1.50 NA
7: ABC 2010-12-31 2010-01-11 2 1.2 2009-07-15 2010-01-11 1.50 1.3
8: ABC 2010-12-31 2010-04-02 2 1.8 2009-10-04 2010-04-02 1.65 1.2
9: ABC 2010-12-31 2010-08-03 2 1.7 2010-02-04 2010-08-03 1.70 1.8
10: ABC 2010-12-31 2010-01-09 3 1.6 2009-07-13 2010-01-09 1.35 NA
11: DEF 2011-12-31 2011-02-05 1 2.3 2010-08-09 2011-02-05 NA NA
Great question. Try this :
dt
TICKER PERIOD DATE ID VALUE
[1,] ABC 2010-12-31 2010-01-05 1 1.5
[2,] ABC 2010-12-31 2010-01-08 1 1.4
[3,] ABC 2010-12-31 2010-01-10 1 1.4
[4,] ABC 2010-12-31 2010-01-13 1 1.5
[5,] ABC 2010-12-31 2010-01-07 2 1.3
[6,] ABC 2010-12-31 2010-01-11 2 1.2
[7,] ABC 2010-12-31 2010-01-09 3 1.6
[8,] DEF 2011-12-31 2011-02-05 1 2.3
ids = unique(dt$ID)
dt[,PRIORAVG:=NA_real_]
for (i in 1:nrow(dt))
dt[i,PRIORAVG:=dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),
mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"]]
dt
TICKER PERIOD DATE ID VALUE PRIORAVG
[1,] ABC 2010-12-31 2010-01-05 1 1.5 NA
[2,] ABC 2010-12-31 2010-01-08 1 1.4 1.30
[3,] ABC 2010-12-31 2010-01-10 1 1.4 1.45
[4,] ABC 2010-12-31 2010-01-13 1 1.5 1.40
[5,] ABC 2010-12-31 2010-01-07 2 1.3 1.50
[6,] ABC 2010-12-31 2010-01-11 2 1.2 1.50
[7,] ABC 2010-12-31 2010-01-09 3 1.6 1.35
[8,] DEF 2011-12-31 2011-02-05 1 2.3 NA
Then what you had already with a slight simplification ...
dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),VALUE,roll=TRUE,mult="last"]]
TICKER PERIOD DATE ID VALUE PRIORAVG PREV
[1,] ABC 2010-12-31 2010-01-05 1 1.5 NA NA
[2,] ABC 2010-12-31 2010-01-08 1 1.4 1.30 1.5
[3,] ABC 2010-12-31 2010-01-10 1 1.4 1.45 1.4
[4,] ABC 2010-12-31 2010-01-13 1 1.5 1.40 1.4
[5,] ABC 2010-12-31 2010-01-07 2 1.3 1.50 NA
[6,] ABC 2010-12-31 2010-01-11 2 1.2 1.50 1.3
[7,] ABC 2010-12-31 2010-01-09 3 1.6 1.35 NA
[8,] DEF 2011-12-31 2011-02-05 1 2.3 NA NA
If this is ok as a prototype then a large speed improvement would be to keep the loop but use set()
instead of :=
, to reduce overhead :
for (i in 1:nrow(dt))
set(dt,i,6L,dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),
mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"])
dt
TICKER PERIOD DATE ID VALUE PRIORAVG PREV
[1,] ABC 2010-12-31 2010-01-05 1 1.5 NA NA
[2,] ABC 2010-12-31 2010-01-08 1 1.4 1.30 1.5
[3,] ABC 2010-12-31 2010-01-10 1 1.4 1.45 1.4
[4,] ABC 2010-12-31 2010-01-13 1 1.5 1.40 1.4
[5,] ABC 2010-12-31 2010-01-07 2 1.3 1.50 NA
[6,] ABC 2010-12-31 2010-01-11 2 1.2 1.50 1.3
[7,] ABC 2010-12-31 2010-01-09 3 1.6 1.35 NA
[8,] DEF 2011-12-31 2011-02-05 1 2.3 NA NA
That should be a lot faster than the repeated vector scans shown in the question.
Or, the operation could be vectorized. But that would be less easy to write and read due to the features of this task.
Btw, there isn't any data in the question that would test the 180 day requirement. If you add some and show expected output again then I'll add the calculation of age using join inherited scope I mentioned in comments.