Using dplyr summarize with different operations for multiple columns
As other people have mentioned, this is normally done by calling summarize_each
/ summarize_at
/ summarize_if
for every group of columns that you want to apply the summarizing function to. As far as I know, you would have to create a custom function that performs summarizations to each subset. You can for example set the colnames in such way that you can use the select helpers (e.g. contains()
) to filter just the columns that you want to apply the function to. If not, then you can set the specific column numbers that you want to summarize.
For the example you mentioned, you could try the following:
summarizer <- function(tb, colsone, colstwo, colsthree,
funsone, funstwo, funsthree, group_name) {
return(bind_cols(
summarize_all(select(tb, colsone), .funs = funsone),
summarize_all(select(tb, colstwo), .funs = funstwo) %>%
ungroup() %>% select(-matches(group_name)),
summarize_all(select(tb, colsthree), .funs = funsthree) %>%
ungroup() %>% select(-matches(group_name))
))
}
#With colnames
iris %>% as.tibble() %>%
group_by(Species) %>%
summarizer(colsone = contains("Sepal"),
colstwo = matches("Petal.Length"),
colsthree = c(-contains("Sepal"), -matches("Petal.Length")),
funsone = "sum",
funstwo = "mean",
funsthree = "first",
group_name = "Species")
#With indexes
iris %>% as.tibble() %>%
group_by(Species) %>%
summarizer(colsone = 1:2,
colstwo = 3,
colsthree = 4,
funsone = "sum",
funstwo = "mean",
funsthree = "first",
group_name = "Species")
Fortunately, there is a much simpler way available now.
With the new dplyr 1.0.0 coming out soon, you can leverage the across
function for this purpose.
All you need to type is:
iris %>%
group_by(Species) %>%
summarize(
# I want the sum over the first two columns,
across(c(1,2), sum),
# the mean over the third
across(3, mean),
# the first value for all remaining columns (after a group_by(Species))
across(-c(1:3), first)
)
Great, isn't it?
I first thought the across is not necessary as the scoped variants worked just fine, but this use case is exactly why the across
function can be very beneficial.
You can get the latest version of dplyr by devtools::install_github("tidyverse/dplyr")