subsetting a data.table based on a named list
You can use the CJ
(Cross Join) function from data.table
to make a filtering table from the list.
lookup <- do.call(CJ, param)
head(lookup)
# a b c
# 1: 1 2 5
# 2: 1 2 7
# 3: 1 2 10
# 4: 1 3 5
# 5: 1 3 7
# 6: 1 3 10
DT[
lookup,
on = names(lookup),
nomatch = NULL
]
# a b c d
# 1: 1 3 5 1
# 2: 3 3 7 3
Note that nomatch = 0
means any combo in lookup
that doesn't exist in DT
won't return a row.
Using Map
we can do
DT[DT[, all(Map(`%in%`, .SD, param)), by = 1:nrow(DT)]$V1]
# a b c d
#1: 1 3 5 1
#2: 3 3 7 3
For each row we check if all elements in DT
are present in param
.
Thanks to @Frank, this can be improved to
DT[DT[, all(mapply(`%in%`, .SD, param)), by = 1:nrow(DT), .SDcols=names(param)]$V1]
You could build the expression with call(fun, ...)
and as.name
:
myp = Filter(Negate(is.null), param)
exs = Map(function(var, val) call("%in%", var, val), var = sapply(names(myp), as.name), val = myp)
exi = Reduce(function(x,y) call("&", x, y), exs)
ex = call("[", x = as.name("DT"), i = exi)
# DT[i = a %in% 1:10 & b %in% 2:3 & c %in% c(5, 7, 10)]
eval(ex)
# a b c d
# 1: 1 3 5 1
# 2: 3 3 7 3
By composing the call correctly, you can take advantage of efficient algorithms for "indices" in the data.table (see the package vignettes). You can also turn verbose on to get a note about the inefficiency of specifying param$c
as numeric when DT$c
is int:
> z <- as.call(c(as.list(ex), verbose=TRUE))
> eval(z)
Optimized subsetting with index 'c__b__a'
on= matches existing index, using index
Coercing double column i.'c' to integer to match type of x.'c'. Please avoid coercion for efficiency.
Starting bmerge ...done in 0.020sec
a b c d
1: 1 3 5 1
2: 3 3 7 3
That is, you should use c(5L, 7L, 10L)
.
A join, as in Nathan's answer, also uses indices, but building and joining on the Cartesian table of param
will be costly if prod(lengths(param))
is large.
@markus approach may be slow due to by-row operation, so here is a variant:
DT[do.call(pmin, Map(`%in%`, DT[, names(param), with=FALSE], param)) == 1L]
# a b c d
# 1: 1 3 5 1
# 2: 3 3 7 3
The trick is that the elementwise version of all
is pmin(...) == 1L
. Likewise, any
corresponds to pmax(...) == 1L
. (This is why pany
/pall
are not included in this conversation on r-devel: http://r.789695.n4.nabble.com/There-is-pmin-and-pmax-each-taking-na-rm-how-about-psum-td4647841.html)