How to subset dataframe based on a "not equal to" criteria applied to a large number of columns?
We can create a vector with the codes to be removed and use rowSums
to remove, i.e.
codes_to_remove <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", "G308",
"G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")
df[rowSums(sapply(df[-1], `%in%`, codes_to_remove)) == 0,]
which gives,
ID disease_code_1 disease_code_2 disease_code_3 1 1001 I802 A071 H250 2 1002 H356 NA NA 4 1004 D235 NA I802 5 1005 B178 NA NA 8 1008 C761 NA NA 11 1011 J679 A045 D352
How about this:
> dementia <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", "G308",
+ "G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")
>
> dementia <- apply(sapply(df[, -1], function(x) {x %in% dementia}), 1, any)
>
> df[!dementia,]
ID disease_code_1 disease_code_2 disease_code_3
1 1001 I802 A071 H250
2 1002 H356 NA NA
4 1004 D235 NA I802
5 1005 B178 NA NA
8 1008 C761 NA NA
11 1011 J679 A045 D352
>
Edit:
An even more elegant solution, thanks to @ Ronan Shah:
> df[apply(df[-1], 1, function(x) {!any(x %in% dementia)}),]
ID disease_code_1 disease_code_2 disease_code_3
1 1001 I802 A071 H250
2 1002 H356 NA NA
4 1004 D235 NA I802
5 1005 B178 NA NA
8 1008 C761 NA NA
11 1011 J679 A045 D352
Hope it helps.
One dplyr
possibility could be:
df %>%
filter_at(vars(2:4), all_vars(! . %in% c("F023","G20","F009","F002","F001","F000","F00",
"G309", "G308","G301","G300","G30", "F01","F018","F013",
"F012", "F011", "F010","F01")))
ID disease_code_1 disease_code_2 disease_code_3
1 1001 I802 A071 H250
2 1002 H356 NA NA
3 1004 D235 NA I802
4 1005 B178 NA NA
5 1008 C761 NA NA
6 1011 J679 A045 D352
In this case, it checks whether any of the columns 2:4 contains any of the given codes.
Or:
df %>%
filter_at(vars(contains("disease_code")), all_vars(! . %in% c("F023","G20","F009","F002","F001","F000","F00",
"G309", "G308","G301","G300","G30", "F01","F018","F013",
"F012", "F011", "F010","F01")))
In this case, it checks whether any of the columns with names disease_code
contains any of the given codes.