R: fast (conditional) subsetting where feasible

I agree with Konrad's answer that this should throw a warning or at least report what happens somehow. Here's a data.table way that will take advantage of indices (see package vignettes for details):

f = function(x, ..., verbose=FALSE){
  L   = substitute(list(...))[-1]
  mon = data.table(cond = as.character(L))[, skip := FALSE]

  for (i in seq_along(L)){
    d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
    if (nrow(d)){
      x = d
    } else {
      mon[i, skip := TRUE]
    }    
  }
  print(mon)
  return(x)
}

Usage

> f(dat, x > 119, y > 219, y > 1e6)
        cond  skip
1:   x > 119 FALSE
2:   y > 219 FALSE
3: y > 1e+06  TRUE
   id        x        y        z
1: 55 119.2634 219.0044 315.6556

The verbose option will print extra info provided by data.table package, so you can see when indices are being used. For example, with f(dat, x == 119, verbose=TRUE), I see it.

because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.).

If it's for non-interactive use, maybe better to have the function return list(mon = mon, x = x) to more easily keep track of what the query was and what happened. Also, the verbose console output could be captured and returned.


An interesting approach could be developed using modified filter function offered in dplyr. In case of conditions not being met the non_empty_filter filter function returns original data set.

Notes

  • IMHO, this is fairly non-standard behaviour and should be reported via warning. Of course, this can be removed and has no bearing on the function results.

Function

library(tidyverse)
library(rlang) # enquo
non_empty_filter <- function(df, expr) {
    expr <- enquo(expr)

    res <- df %>% filter(!!expr)

    if (nrow(res) > 0) {
        return(res)
    } else {
        # Indicate that filter is not applied
        warning("No rows meeting conditon")
        return(df)
    }
}

Condition met

Behaviour: Returning one row for which the condition is met.

dat %>%
    non_empty_filter(x > 119 & y > 219)

Results

# id        x        y        z
# 1 55 119.2634 219.0044 315.6556

Condition not met

Behaviour: Returning the full data set as the whole condition is not met due to y > 1e6.

dat %>%
    non_empty_filter(x > 119 & y > 219 & y > 1e6)

Results

# id        x        y        z
# 1:   1 109.3400 208.6732 308.7595
# 2:   2 101.6920 201.0989 310.1080
# 3:   3 119.4697 217.8550 313.9384
# 4:   4 111.4261 205.2945 317.3651
# 5:   5 100.4024 212.2826 305.1375
# 6:   6 114.4711 203.6988 319.4913
# 7:   7 112.1879 209.5716 319.6732
# 8:   8 106.1344 202.2453 312.9427
# 9:   9 101.2702 210.5923 309.2864
# 10:  10 106.1071 211.8266 301.0645

Condition met/not met one-by-one

Behaviour: Skipping filter that would return an empty data set.

dat %>%
    non_empty_filter(y > 1e6) %>% 
    non_empty_filter(x > 119) %>% 
    non_empty_filter(y > 219)

Results

# id        x        y        z
# 1 55 119.2634 219.0044 315.6556