How do I filter a range of numbers in R?
Piggybacking on @Anna 's answer, I just ran a few of the options to see which was faster on a larger dataset for a problem have at work. I used the setup from here (Faster way to subset on rows of a data frame in R?) I checked it on a 1 billion row (16gb) dataset. Looks like data.table edged out dplyr by a little bit. I am just starting to use data.table though so I may have not used the most efficient code. Oh, also, I narrowed it down to these 4 based on times from a 100million row dataset. See below:
set.seed(42)
# 1 billion rows
df <- data.frame(age=sample(1:65,1e9,replace=TRUE),x=rnorm(1e9),y=rpois(1e9,25))
microbenchmark(df1 <- df %>% filter(age >= 5 & age <= 25),
df2 <- df %>% filter(dplyr::between(df$age, 5, 25)),
times=10)
Unit: seconds
expr min lq mean median uq max neval
df %>% filter(age >= 5 & age <= 25) 15.327 15.796 16.526 16.601 17.086 17.996 10
df %>% filter(dplyr::between(df$age, 5, 25)) 14.214 14.752 15.413 15.487 16.121 16.447 10
DT <- as.data.table(df)
microbenchmark(dt1 <- DT[age %inrange% c(5, 25)],
dt2 <- DT[age %between% c(5, 25)],
times = 10)
Unit: seconds
expr min lq mean median uq max neval
dt1 <- DT[age %inrange% c(5, 25)] 15.122 16.042 17.180 16.969 17.310 22.138 10
dt2 <- DT[age %between% c(5, 25)] 10.212 11.121 11.675 11.436 12.132 13.913 10
Plenty of good dplyr solutions such as filtering in or hard-coding the upper and lower bounds already present in some of the answers:
MydataTable%>% filter(between(x, 3, 70))
Mydata %>% filter(x %in% 3:7)
Mydata %>% filter(x>=3&x<=7)
You could also work with data.table, which is very fast for large data sets. inrange
and between
work identically for this purpose
library(data.table)
MydataTable <- data.table(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
y = c(20, 30, 45, 54, 65, 78, 97, 102, 123, 156))
MydataTable[x %inrange% c(3,7)]
MydataTable[x %between% c(3,7)]
A benefit of this method (besides the speed of data.table) is that you only need to specify the min and max range - you are not creating an array to subset the filter.
A time comparison of these methods:
> df <- data.frame(x = sample(1:10, 10000000, replace = T),
+ y = sample(1:10, 10000000, replace = T))
> system.time({ df %>% filter(between(x, 3, 7)) })
user system elapsed
0.18 0.05 0.14
> system.time({ df %>% filter(x %in% 3:7) })
user system elapsed
0.19 0.06 0.29
> system.time({ df %>% filter(x>=3&x<=7) })
user system elapsed
0.17 0.09 0.26
> dt <- data.table(df)
> system.time( {dt[x %inrange% c(3,7)] })
user system elapsed
0.13 0.07 0.21
> system.time( {dt[x %between% c(3,7)] })
user system elapsed
0.18 0.05 0.13
And the good old base::subset
:
subset(Mydata, x >= 3 & x <= 7)
subset(Mydata, x %in% 3:7)
You can use %in%
, or as has been mentioned, alternatively dplyr
s between()
:
library(dplyr)
new_frame <- Mydata %>% filter(x %in% (3:7) )
new_frame
# x y
# 1 3 45
# 2 4 54
# 3 5 65
# 4 6 78
# 5 7 97
While %in%
works great for integers (or other equally spaced sequences), if you need to filter on floats, or any value between and including your two end points, or just want an alternative that's a bit more explicit than %in%
, use dplyr
's between()
:
new_frame2 <- Mydata%>% filter( between(x, 3, 7) )
new_frame2
# x y
# 1 3 45
# 2 4 54
# 3 5 65
# 4 6 78
# 5 7 97
To further clarify, note that %in%
checks for the presence in a set of values:
3 %in% 3:7
# [1] TRUE
5 %in% 3:7
# [1] TRUE
5.0 %in% 3:7
# [1] TRUE
The above return TRUE
because 3:7
is shorthand for seq(3, 7)
which produces:
3:7
# [1] 3 4 5 6 7
seq(3, 7)
# [1] 3 4 5 6 7
As such, if you were to use %in%
to check for values not produced by :
, it will return FALSE
:
4.5 %in% 3:7
# [1] FALSE
4.15 %in% 3:7
# [1] FALSE
Whereas between
checks against the end points and all values in between:
between(3, 3, 7)
# [1] TRUE
between(7, 3, 7)
# [1] TRUE
between(5, 3, 7)
# [1] TRUE
between(5.0, 3, 7)
# [1] TRUE
between(4.5, 3, 7)
# [1] TRUE
between(4.15, 3, 7)
# [1] TRUE