Conditionally selecting columns in dplyr where certain proportion of values is NA

Like this perhaps?

dta %>% select(which(colMeans(is.na(.)) < 0.5)) %>% head
#  observation    valueA    valueB    valueC
#1           1 0.2655087 0.9347052 0.8209463
#2           2 0.3721239        NA        NA
#3           3 0.5728534        NA        NA
#4           4 0.9082078        NA        NA
#5           5 0.2016819        NA        NA
#6           6 0.8983897 0.3861141        NA

Updated with colMeans instead of colSums which means you don't need to divide by the number of rows any more.

And, just for the record, in base R you could also use colMeans:

dta[,colMeans(is.na(dta)) < 0.5]

I think this does the job:

dta %>% select_if(~mean(is.na(.)) < 0.5) %>% head() 


 observation    valueA    valueB    valueC
  1           0.2655087 0.9347052 0.8209463
  2           0.3721239        NA        NA
  3           0.5728534        NA        NA
  4           0.9082078        NA        NA
  5           0.2016819        NA        NA
  6           0.8983897 0.3861141        NA

We can use extract from magrittr after getting a logical vector with summarise_each/unlist

library(magrittr)
library(dplyr)
dta %>% 
    summarise_each(funs(sum(is.na(.)) < n()/2)) %>% 
    unlist() %>%
    extract(dta,.)

Or use Filter from base R

Filter(function(x) sum(is.na(x)) < length(x)/2, dta)

Or a slightly compact option is

Filter(function(x) mean(is.na(x)) < 0.5, dta)

Conditionally selecting columns in dplyr where certain proportion of values is NA

Tags:

R

Filter

Dataframe

Na

Dplyr

Related

Recent Posts