Collapsing rows where some are all NA, others are disjoint with some NAs
Since dplyr 1.0.0
, you can also do (using the data provided by @Khashaa):
df %>%
group_by(ID) %>%
summarize(across(everything(), ~ first(na.omit(.))))
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 NA 30 35 40
Here's a couple of aggregate attempts:
aggregate(. ~ ID, data=dat, FUN=na.omit, na.action="na.pass")
# ID Col1 Col2 Col3 Col4
#1 1 5 10 15 20
#2 2 25 30 35 40
Since aggregate
's formula interface by default uses na.omit
on the entire data before doing any grouping, it will delete every row of dat
as they all contain at least one NA
value. Try it: nrow(na.omit(dat))
returns 0
. So in this case, use na.pass
in aggregate
and then na.omit
to skip over the NA
s that were passed through.
Alternatively, don't use the formula interface and specify the columns to aggregate manually:
aggregate(dat[-1], dat[1], FUN=na.omit )
aggregate(dat[c("Col1","Col2","Col3","Col4")], dat["ID"], FUN=na.omit)
# ID Col1 Col2 Col3 Col4
#1 1 5 10 15 20
#2 2 25 30 35 40
Try
library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE)))
Edit: To account for the case in which one column has all NAs
for a certain ID
, we need sum_NA()
function which returns NA
if all are NAs
txt <- "ID Col1 Col2 Col3 Col4
1 NA NA NA NA
1 5 10 NA NA
1 NA NA 15 20
2 NA NA NA NA
2 NA 30 NA NA
2 NA NA 35 40"
DF <- read.table(text = txt, header = TRUE)
# original code
DF %>%
group_by(ID) %>%
summarise_each(funs(sum(., na.rm = TRUE)))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 0 30 35 40
sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}
DF %>%
group_by(ID) %>%
summarise_all(funs(sum_NA))
DF %>%
group_by(ID) %>%
summarise_if(is.numeric, funs(sum_NA))
# A tibble: 2 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 NA 30 35 40
Here's a data table approach that uses na.omit()
across the columns, grouped by ID.
library(data.table)
setDT(df)[, lapply(.SD, na.omit), by = ID]
# ID Col1 Col2 Col3 Col4
# 1: 1 5 10 15 20
# 2: 2 25 30 35 40