Conditionally selecting columns in dplyr where certain proportion of values is NA
Like this perhaps?
dta %>% select(which(colMeans(is.na(.)) < 0.5)) %>% head
# observation valueA valueB valueC
#1 1 0.2655087 0.9347052 0.8209463
#2 2 0.3721239 NA NA
#3 3 0.5728534 NA NA
#4 4 0.9082078 NA NA
#5 5 0.2016819 NA NA
#6 6 0.8983897 0.3861141 NA
Updated with colMeans
instead of colSums
which means you don't need to divide by the number of rows any more.
And, just for the record, in base R you could also use colMeans
:
dta[,colMeans(is.na(dta)) < 0.5]
I think this does the job:
dta %>% select_if(~mean(is.na(.)) < 0.5) %>% head()
observation valueA valueB valueC
1 0.2655087 0.9347052 0.8209463
2 0.3721239 NA NA
3 0.5728534 NA NA
4 0.9082078 NA NA
5 0.2016819 NA NA
6 0.8983897 0.3861141 NA
`
We can use extract
from magrittr
after getting a logical vector with summarise_each/unlist
library(magrittr)
library(dplyr)
dta %>%
summarise_each(funs(sum(is.na(.)) < n()/2)) %>%
unlist() %>%
extract(dta,.)
Or use Filter
from base R
Filter(function(x) sum(is.na(x)) < length(x)/2, dta)
Or a slightly compact option is
Filter(function(x) mean(is.na(x)) < 0.5, dta)