subset rows with (1) ALL and (2) ANY columns larger than a specific value

One option is looping row-by-row (e.g. with apply) and using any or all, as proposed in the other two answers. However, this can be inefficient for large data frames.

A vectorized approach would be to use rowSums to determine the number of values in each row matching your criterion, and filter based on that.

(1) When filtering to rows where ALL values are at least 10, this is the same as filtering to cases where the number of values in a row less than or equal to 10 is 0:

df[rowSums(df[,-1] <= 10) == 0,]
#   id v1 v2
# 2  2 15 32
# 4  4 12 17

(2) Similarly, rowSums can easily be used to compute the rows with ANY (at least one) value is larger than 10:

df[rowSums(df[,-1] > 10) > 0,]
#   id v1 v2
# 2  2 15 32
# 4  4 12 17
# 5  5  7 11

The speedup is clear with a larger input:

set.seed(144)
df <- matrix(sample(c(1, 10, 20), 3e6, replace=TRUE), ncol=3)
system.time(df[apply(df[, -1], MARGIN = 1, function(x) all(x > 10)), ])
#    user  system elapsed 
#   1.754   0.156   2.102 
system.time(df[rowSums(df[,-1] <= 10) == 0,])
#    user  system elapsed 
#    0.04    0.01    0.05 

The dplyr equivalent is as follows

library(dplyr)

#ANY
df %>% rowwise() %>%
  filter(any(across(starts_with("v"), ~ sum((. > 10)))))
# A tibble: 3 x 3
# Rowwise: 
     id    v1    v2
  <int> <dbl> <dbl>
1     2    15    32
2     4    12    17
3     5     7    11


#ALL
df %>% rowwise() %>%
  filter(all(across(starts_with("v"), ~ sum((. > 10)))))

# A tibble: 2 x 3
# Rowwise: 
     id    v1    v2
  <int> <dbl> <dbl>
1     2    15    32
2     4    12    17


This can be done using apply with margin 1, which will apply a function to each row. The function to check a given row would be

function(row) {all(row > 10)}

So the way to extract the rows themselves is

df[apply(df, 1, function(row) {all(row > 10)}),]

See functions all() and any() for the first and second parts of your questions respectively. The apply() function can be used to run functions over rows or columns. (MARGIN = 1 is rows, MARGIN = 2 is columns, etc). Note I use apply() on df[, -1] to ignore the id variable when doing the comparisons.

Part 1:

> df <- data.frame(id=c(1:5), v1=c(0,15,9,12,7), v2=c(9,32,6,17,11))
> df[apply(df[, -1], MARGIN = 1, function(x) all(x > 10)), ]
  id v1 v2
2  2 15 32
4  4 12 17

Part 2:

> df[apply(df[, -1], MARGIN = 1, function(x) any(x > 10)), ]
  id v1 v2
2  2 15 32
4  4 12 17
5  5  7 11

To see what is going on, x > 10 returns a logical vector for each row (via apply() indicating whether each element is greater than 10. all() returns TRUE if all element of the input vector are TRUE and FALSE otherwise. any() returns TRUE if any of the elements in the input is TRUE and FALSE if all are FALSE.

I then use the logical vector resulting from the apply() call

> apply(df[, -1], MARGIN = 1, function(x) all(x > 10))
[1] FALSE  TRUE FALSE  TRUE FALSE
> apply(df[, -1], MARGIN = 1, function(x) any(x > 10))
[1] FALSE  TRUE FALSE  TRUE  TRUE

to subset df (as shown above).

Tags:

R

Subset

R Faq