R: fast (conditional) subsetting where feasible
I agree with Konrad's answer that this should throw a warning or at least report what happens somehow. Here's a data.table way that will take advantage of indices (see package vignettes for details):
f = function(x, ..., verbose=FALSE){
L = substitute(list(...))[-1]
mon = data.table(cond = as.character(L))[, skip := FALSE]
for (i in seq_along(L)){
d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
if (nrow(d)){
x = d
} else {
mon[i, skip := TRUE]
}
}
print(mon)
return(x)
}
Usage
> f(dat, x > 119, y > 219, y > 1e6)
cond skip
1: x > 119 FALSE
2: y > 219 FALSE
3: y > 1e+06 TRUE
id x y z
1: 55 119.2634 219.0044 315.6556
The verbose option will print extra info provided by data.table package, so you can see when indices are being used. For example, with f(dat, x == 119, verbose=TRUE)
, I see it.
because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.).
If it's for non-interactive use, maybe better to have the function return list(mon = mon, x = x)
to more easily keep track of what the query was and what happened. Also, the verbose console output could be captured and returned.
An interesting approach could be developed using modified filter
function offered in dplyr
. In case of conditions not being met the non_empty_filter
filter function returns original data set.
Notes
- IMHO, this is fairly non-standard behaviour and should be reported via
warning
. Of course, this can be removed and has no bearing on the function results.
Function
library(tidyverse)
library(rlang) # enquo
non_empty_filter <- function(df, expr) {
expr <- enquo(expr)
res <- df %>% filter(!!expr)
if (nrow(res) > 0) {
return(res)
} else {
# Indicate that filter is not applied
warning("No rows meeting conditon")
return(df)
}
}
Condition met
Behaviour: Returning one row for which the condition is met.
dat %>%
non_empty_filter(x > 119 & y > 219)
Results
# id x y z
# 1 55 119.2634 219.0044 315.6556
Condition not met
Behaviour: Returning the full data set as the whole condition is not met due to y > 1e6
.
dat %>%
non_empty_filter(x > 119 & y > 219 & y > 1e6)
Results
# id x y z
# 1: 1 109.3400 208.6732 308.7595
# 2: 2 101.6920 201.0989 310.1080
# 3: 3 119.4697 217.8550 313.9384
# 4: 4 111.4261 205.2945 317.3651
# 5: 5 100.4024 212.2826 305.1375
# 6: 6 114.4711 203.6988 319.4913
# 7: 7 112.1879 209.5716 319.6732
# 8: 8 106.1344 202.2453 312.9427
# 9: 9 101.2702 210.5923 309.2864
# 10: 10 106.1071 211.8266 301.0645
Condition met/not met one-by-one
Behaviour: Skipping filter that would return an empty data set.
dat %>%
non_empty_filter(y > 1e6) %>%
non_empty_filter(x > 119) %>%
non_empty_filter(y > 219)
Results
# id x y z
# 1 55 119.2634 219.0044 315.6556