filter for complete cases in data.frame using dplyr (case-wise deletion)
Here are some benchmark results for Grothendieck's reply. na.omit() takes 20x as much time as the other two solutions. I think it would be nice if dplyr had a function for this maybe as part of filter.
library('rbenchmark')
library('dplyr')
n = 5e6
n.na = 100000
df = data.frame(
x1 = sample(1:10, n, replace=TRUE),
x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA
benchmark(
df %>% filter(complete.cases(x1,x2)),
df %>% na.omit(),
df %>% (function(x) filter(x, complete.cases(x)))()
, replications=50)
# test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))() 50 5.422 1.000
# 1 df %.% filter(complete.cases(x1, x2)) 50 6.262 1.155
# 2 df %.% na.omit() 50 109.618 20.217
This works for me:
df %>%
filter(complete.cases(df))
Or a little more general:
library(dplyr) # 0.4
df %>% filter(complete.cases(.))
This would have the advantage that the data could have been modified in the chain before passing it to the filter.
Another benchmark with more columns:
set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
na.omit = {df %>% na.omit},
filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
filter = {df %>% filter(complete.cases(.))},
times = 20L,
unit = "relative")
#Unit: relative
# expr min lq median uq max neval
# na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233 20
#filter.anonymous 1.149305 1.022891 1.013779 0.9948659 4.668691 20
# rowSums 2.281002 2.377807 2.420615 2.3467519 5.223077 20
# filter 1.000000 1.000000 1.000000 1.0000000 1.000000 20
Try this:
df %>% na.omit
or this:
df %>% filter(complete.cases(.))
or this:
library(tidyr)
df %>% drop_na
If you want to filter based on one variable's missingness, use a conditional:
df %>% filter(!is.na(x1))
or
df %>% drop_na(x1)
Other answers indicate that of the solutions above na.omit
is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action
attribute whereas the other solutions above do not.
str(df %>% na.omit)
## 'data.frame': 2 obs. of 2 variables:
## $ x1: num 1 2
## $ x2: num 1 2
## - attr(*, "na.action")= 'omit' Named int 3 4
## ..- attr(*, "names")= chr "3" "4"
ADDED Have updated to reflect latest version of dplyr and comments.
ADDED Have updated to reflect latest version of tidyr and comments.