R - Speeding up approximate date match. idata.frame?
Thanks to Matthew Dowle and his addition of the ability to roll backwards as well as forwards in data.table, it is now much simpler to perform this merge.
ST <- data.table(sampleticker)
SR <- data.table(samplereport)
setkey(ST,ticker,date)
SR[,mergerdate:=rdate]
setkey(SR,ticker,mergerdate)
merge<-SR[ST,roll=-Inf]
setnames(merge,"mergerdate","date")
# ticker date rdate
# 1: A 2005-01-25 2005-02-15
# 2: A 2005-03-30 2005-04-15
# 3: AA 2005-02-15 2005-03-01
# 4: AA 2005-04-21 2005-05-01
Here is a data.table
-based solution that's likely to work better than what you are currently using:
library(data.table)
ST <- data.table(sampleticker, key="ticker")
SR <- data.table(samplereport, key="ticker")
SR <- SR[with(SR, order(ticker, rdate)),] # rdates need to be in increasing order
SR[ST, list(date = date,
rdate = rdate[match(TRUE, (rdate > date))]), ]
ticker date rdate
[1,] A 2005-01-25 2005-02-15
[2,] A 2005-03-30 2005-04-15
[3,] AA 2005-02-15 2005-03-01
[4,] AA 2005-04-21 2005-05-01
Of course, it sounds like what you really want to do is to merge together two much wider data.frames. To demonstrate one way of accomplishing that, in the example below, I add some columns to both data.tables, and then show how you could merge the appropriate rows:
# Add some columns to both data.tables
ST$alpha <- letters[seq_len(nrow(ST))]
SR$n <- seq_len(nrow(SR))
SR$ALPHA <- LETTERS[seq_len(nrow(SR))]
# Perform a merge that includes the whole rows from samplereport
# corresponding to the selected rdate
RES <- SR[ST, cbind(date, .SD[match(TRUE,(rdate>date)),-1]), ]
# Merge res (containing the selected rows from samplereport) back together
# with sampleticker
keycols <- c("ticker", "date")
setkeyv(RES, keycols)
setkeyv(ST, keycols)
ST[RES]
# ticker date alpha rdate n ALPHA
# [1,] A 2005-01-25 a 2005-02-15 1 A
# [2,] A 2005-03-30 b 2005-04-15 3 C
# [3,] AA 2005-02-15 c 2005-03-01 4 D
# [4,] AA 2005-04-21 d 2005-05-01 6 F
Here's a solution that follows up on Matthew Dowle's observation that this is a natural place to apply data.table
's roll=TRUE
argument.
If you are to apply it, there is one wrinkle to be ironed out. roll=TRUE
is designed such that when an exact match isn't found for the last column of the key (here date), the value from the nearest previous date will be rolled forward. You, though, want the opposite (and even when there is an exact match, you still want the value from the next available date).
A first attempt might be to sort by "ticker"
, and the by "rdate"
in reverse order, merging with the resulting reordered SR
. That would work, except that data.table
doesn't want to let you sort in reverse order: keying by "rdate"
forces that column into ascending order. (data.table
needs to do that in order to implement the speedy matching and joining for which it was designed).
My solution below, is to create a new column -- "rnd"
, for "reverse numerical date" -- in both data.tables, whose values are formed by doing -as.numeric(date)
. This assigns a unique value to each date. Moreover, because the values have been multiplied by -1
, sorting them in ascending order has the effect of sorting dates in descending order.
(One other detail: because you don't want exact matches, and instead always want the next date after the current one, I've subtracted 1
from sampleticker's rnd
, which has the desired effect. To confirm that it's doing its job correctly, I slightly edited your example data to include one possible exact match ("2005-1-25"
) which should not be selected by the merge).
# Create sample data.tables
library(data.table)
ST <- data.table(ticker = c("A","A","AA","AA"),
date = as.Date(c("2005-1-25","2005-03-30","2005-02-15",
"2005-04-21"), format="%Y-%m-%d"),
alpha = letters[1:4])
SR <- data.table(ticker = c("A","A","A","AA","AA","AA"),
rdate = as.Date(c("2005-1-25","2005-03-15","2005-04-15",
"2005-03-01","2005-04-20","2005-05-01"),
format="%Y-%m-%d"),
ALPHA = LETTERS[1:6])
With sample data in hand, set up for and perform the desired merge:
# Create a "reverse numerical date" column, which will uniquely
# identify date, and allow them to be sorted in reverse temporal order
ST$rnd <- -(as.numeric(ST$date) + 1)
SR$rnd <- -(as.numeric(SR$rdate))
# key (and thus sort) both data.tables by ticker and "reverse numerical date"
keycols <- c("ticker", "rnd")
setkeyv(ST, keycols)
setkeyv(SR, keycols)
# The syntax of the merge is now as simple as can be
res <- SR[ST, roll=TRUE]
# Finally, put the results back in temporal order, and pretty up the column order
setkeyv(res, c("ticker", "date"))
setcolorder(res, c("ticker", "date", "rdate", "alpha", "ALPHA", "rnd"))
res
# ticker date rdate alpha ALPHA rnd
# [1,] A 2005-01-25 2005-03-15 a B -12809
# [2,] A 2005-03-30 2005-04-15 b C -12873
# [3,] AA 2005-02-15 2005-03-01 c D -12830
# [4,] AA 2005-04-21 2005-05-01 d F -12895