How to subset dataframe based on a "not equal to" criteria applied to a large number of columns?

We can create a vector with the codes to be removed and use rowSums to remove, i.e.

codes_to_remove <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", "G308",
                "G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")

df[rowSums(sapply(df[-1], `%in%`, codes_to_remove)) == 0,]

which gives,

     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
8  1008           C761             NA             NA
11 1011           J679           A045           D352

How about this:

> dementia <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", "G308",
+               "G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")
> 
> dementia <- apply(sapply(df[, -1], function(x) {x %in% dementia}), 1, any)
> 
> df[!dementia,]
     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
8  1008           C761             NA             NA
11 1011           J679           A045           D352
>

Edit:

An even more elegant solution, thanks to @ Ronan Shah:

> df[apply(df[-1], 1, function(x) {!any(x %in% dementia)}),]
     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
8  1008           C761             NA             NA
11 1011           J679           A045           D352

Hope it helps.

One dplyr possibility could be:

df %>%
 filter_at(vars(2:4), all_vars(! . %in% c("F023","G20","F009","F002","F001","F000","F00",    
            "G309", "G308","G301","G300","G30", "F01","F018","F013",
            "F012", "F011", "F010","F01")))

    ID disease_code_1 disease_code_2 disease_code_3
1 1001           I802           A071           H250
2 1002           H356             NA             NA
3 1004           D235             NA           I802
4 1005           B178             NA             NA
5 1008           C761             NA             NA
6 1011           J679           A045           D352

In this case, it checks whether any of the columns 2:4 contains any of the given codes.

Or:

df %>%
 filter_at(vars(contains("disease_code")), all_vars(! . %in% c("F023","G20","F009","F002","F001","F000","F00",    
            "G309", "G308","G301","G300","G30", "F01","F018","F013",
            "F012", "F011", "F010","F01")))

In this case, it checks whether any of the columns with names disease_code contains any of the given codes.

How to subset dataframe based on a "not equal to" criteria applied to a large number of columns?

Tags:

R

Filter

Subset

Dataframe

Related

Recent Posts