subset rows with (1) ALL and (2) ANY columns larger than a specific value
One option is looping row-by-row (e.g. with apply
) and using any
or all
, as proposed in the other two answers. However, this can be inefficient for large data frames.
A vectorized approach would be to use rowSums
to determine the number of values in each row matching your criterion, and filter based on that.
(1) When filtering to rows where ALL values are at least 10, this is the same as filtering to cases where the number of values in a row less than or equal to 10 is 0:
df[rowSums(df[,-1] <= 10) == 0,]
# id v1 v2
# 2 2 15 32
# 4 4 12 17
(2) Similarly, rowSums
can easily be used to compute the rows with ANY (at least one) value is larger than 10:
df[rowSums(df[,-1] > 10) > 0,]
# id v1 v2
# 2 2 15 32
# 4 4 12 17
# 5 5 7 11
The speedup is clear with a larger input:
set.seed(144)
df <- matrix(sample(c(1, 10, 20), 3e6, replace=TRUE), ncol=3)
system.time(df[apply(df[, -1], MARGIN = 1, function(x) all(x > 10)), ])
# user system elapsed
# 1.754 0.156 2.102
system.time(df[rowSums(df[,-1] <= 10) == 0,])
# user system elapsed
# 0.04 0.01 0.05
The dplyr
equivalent is as follows
library(dplyr)
#ANY
df %>% rowwise() %>%
filter(any(across(starts_with("v"), ~ sum((. > 10)))))
# A tibble: 3 x 3
# Rowwise:
id v1 v2
<int> <dbl> <dbl>
1 2 15 32
2 4 12 17
3 5 7 11
#ALL
df %>% rowwise() %>%
filter(all(across(starts_with("v"), ~ sum((. > 10)))))
# A tibble: 2 x 3
# Rowwise:
id v1 v2
<int> <dbl> <dbl>
1 2 15 32
2 4 12 17
This can be done using apply
with margin 1, which will apply a function to each row. The function to check a given row would be
function(row) {all(row > 10)}
So the way to extract the rows themselves is
df[apply(df, 1, function(row) {all(row > 10)}),]
See functions all()
and any()
for the first and second parts of your questions respectively. The apply()
function can be used to run functions over rows or columns. (MARGIN = 1
is rows, MARGIN = 2
is columns, etc). Note I use apply()
on df[, -1]
to ignore the id
variable when doing the comparisons.
Part 1:
> df <- data.frame(id=c(1:5), v1=c(0,15,9,12,7), v2=c(9,32,6,17,11))
> df[apply(df[, -1], MARGIN = 1, function(x) all(x > 10)), ]
id v1 v2
2 2 15 32
4 4 12 17
Part 2:
> df[apply(df[, -1], MARGIN = 1, function(x) any(x > 10)), ]
id v1 v2
2 2 15 32
4 4 12 17
5 5 7 11
To see what is going on, x > 10
returns a logical vector for each row (via apply()
indicating whether each element is greater than 10. all()
returns TRUE
if all element of the input vector are TRUE
and FALSE
otherwise. any()
returns TRUE
if any of the elements in the input is TRUE
and FALSE
if all are FALSE
.
I then use the logical vector resulting from the apply()
call
> apply(df[, -1], MARGIN = 1, function(x) all(x > 10))
[1] FALSE TRUE FALSE TRUE FALSE
> apply(df[, -1], MARGIN = 1, function(x) any(x > 10))
[1] FALSE TRUE FALSE TRUE TRUE
to subset df
(as shown above).