Excluding columns from a dataframe based on column sums
I've just had the same problem. Here's a solution using Tidyverse, in case you have both numeric and non numeric columns :
library(tidyverse)
set.seed(123)
dat <- data.frame(var1 = runif(10), var2 = rnorm(10), var3 = rlnorm(10), var4 = "notNumeric", var5 =0, var6 = FALSE )
dat %>%
select_if(negate(function(col) is.numeric(col) && sum(col) < 15))
What about a simple subset? First, we create a simple data frameL
R> dd = data.frame(x = runif(5), y = 20*runif(5), z=20*runif(5))
Then select the columns where the sum is greater than 15
R> dd1 = dd[,colSums(dd) > 15]
R> ncol(dd1)
[1] 2
In your data set, you only want to subset columns 6 onwards, so something like:
##Drop the first five columns
dd[,colSums(dd[,6:ncol(dd)]) > 15]
or
#Keep the first six columns
cols_to_drop = c(rep(TRUE, 5), dd[,6:ncol(dd)]>15)
dd[,cols_to_drop]
should work.
The key part to note is that in the square brackets, we want a vector of logicals, i.e. a vector of TRUE and FALSE. So if you wanted to subset using something a bit more complicated, then create a function that returns TRUE or FALSE and subset as usual.
You should be able to directly index your data.frame with a boolean and colSums()
. For example:
set.seed(123)
dat <- data.frame(var1 = runif(10), var2 = rnorm(10), var3 = rlnorm(10))
colSums(dat)
#-----
var1 var2 var3
5.782475 1.317914 12.91987
#Let's set the threshold at 5, so we should get var1 and var3
> dat[, colSums(dat) > 5]
#-----
var1 var3
1 0.2875775 5.9709924
2 0.7883051 1.6451811
3 0.4089769 0.1399294
...
EDIT - to address non-numeric columns
set.seed(123)
dat <- data.frame(var1 = runif(10), var2 = rnorm(10), var3 = rlnorm(10), var4 = "notNumeric")
require(plyr)
dat[, -which(numcolwise(sum)(dat) < 5)]
Consolec:/documents and settings/charles/desktop/
> dat[, -which(numcolwise(sum)(dat) < 5)]
var1 var3 var4
1 0.2875775 5.9709924 notNumeric
2 0.7883051 1.6451811 notNumeric
3 0.4089769 0.1399294 notNumeric
.....