subsetting a data.table based on a named list

You can use the CJ (Cross Join) function from data.table to make a filtering table from the list.

lookup <- do.call(CJ, param)
head(lookup)
#    a b  c
# 1: 1 2  5
# 2: 1 2  7
# 3: 1 2 10
# 4: 1 3  5
# 5: 1 3  7
# 6: 1 3 10

DT[
    lookup,
    on = names(lookup),
    nomatch = NULL
]
#    a b c d
# 1: 1 3 5 1
# 2: 3 3 7 3

Note that nomatch = 0 means any combo in lookup that doesn't exist in DT won't return a row.

Using Map we can do

DT[DT[, all(Map(`%in%`, .SD, param)), by = 1:nrow(DT)]$V1]
#   a b c d
#1: 1 3 5 1
#2: 3 3 7 3

For each row we check if all elements in DT are present in param.

Thanks to @Frank, this can be improved to

DT[DT[, all(mapply(`%in%`, .SD, param)), by = 1:nrow(DT), .SDcols=names(param)]$V1]

You could build the expression with call(fun, ...) and as.name:

myp    = Filter(Negate(is.null), param)

exs = Map(function(var, val) call("%in%", var, val), var = sapply(names(myp), as.name), val = myp)
exi = Reduce(function(x,y) call("&", x, y), exs)
ex = call("[", x = as.name("DT"), i = exi)
# DT[i = a %in% 1:10 & b %in% 2:3 & c %in% c(5, 7, 10)]

eval(ex)
#    a b c d
# 1: 1 3 5 1
# 2: 3 3 7 3

By composing the call correctly, you can take advantage of efficient algorithms for "indices" in the data.table (see the package vignettes). You can also turn verbose on to get a note about the inefficiency of specifying param$c as numeric when DT$c is int:

> z <- as.call(c(as.list(ex), verbose=TRUE))
> eval(z)
Optimized subsetting with index 'c__b__a'
on= matches existing index, using index
Coercing double column i.'c' to integer to match type of x.'c'. Please avoid coercion for efficiency.
Starting bmerge ...done in 0.020sec 
   a b c d
1: 1 3 5 1
2: 3 3 7 3

That is, you should use c(5L, 7L, 10L).

A join, as in Nathan's answer, also uses indices, but building and joining on the Cartesian table of param will be costly if prod(lengths(param)) is large.

@markus approach may be slow due to by-row operation, so here is a variant:

DT[do.call(pmin, Map(`%in%`, DT[, names(param), with=FALSE], param)) == 1L]

#    a b c d
# 1: 1 3 5 1
# 2: 3 3 7 3

The trick is that the elementwise version of all is pmin(...) == 1L. Likewise, any corresponds to pmax(...) == 1L. (This is why pany/pall are not included in this conversation on r-devel: http://r.789695.n4.nabble.com/There-is-pmin-and-pmax-each-taking-na-rm-how-about-psum-td4647841.html)

subsetting a data.table based on a named list

Tags:

List

R

Subset

Data.Table

Related

Recent Posts