dplyr summarise() with multiple return values from a single function

In recent versions of the tidyverse, this is possible.

First, in the example you provided, the function returns a one-row data frame. If we use such a function in summarize(), it generates a data-frame column, which we can turn into separate columns via unpack().

library(tidyverse)
library(psych)

describe(diamonds$price)
#>    vars     n   mean      sd median trimmed     mad min   max range skew
#> X1    1 53940 3932.8 3989.44   2401 3158.99 2475.94 326 18823 18497 1.62
#>    kurtosis    se
#> X1     2.18 17.18

diamonds %>%
  group_by(cut) %>%
  summarize(descr = describe(price)) %>%
  unpack(cols = descr)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 5 x 14
#>   cut    vars     n  mean    sd median trimmed   mad   min   max range  skew
#>   <ord> <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair      1  1610 4359. 3560.  3282    3696. 2183.   337 18574 18237  1.78
#> 2 Good      1  4906 3929. 3682.  3050.   3252. 2853.   327 18788 18461  1.72
#> 3 Very…     1 12082 3982. 3936.  2648    3243. 2855.   336 18818 18482  1.60
#> 4 Prem…     1 13791 4584. 4349.  3185    3822. 3371.   326 18823 18497  1.33
#> 5 Ideal     1 21551 3458. 3808.  1810    2656. 1631.   326 18806 18480  1.84
#> # … with 2 more variables: kurtosis <dbl>, se <dbl>

Second, in some cases a function simply returns a vector as output. In those cases, summarize() generates one new row per value generated.

set.seed(1234)
dsmall <- diamonds[sample(nrow(diamonds), 25), ]

unique(dsmall$clarity)
#> [1] I1   SI2  VVS2 VS1  VVS1 VS2  SI1  IF  
#> Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF

dsmall %>%
  group_by(cut) %>%
  summarize(clarity = unique(clarity))
#> `summarise()` regrouping output by 'cut' (override with `.groups` argument)
#> # A tibble: 17 x 2
#> # Groups:   cut [4]
#>    cut       clarity
#>    <ord>     <ord>  
#>  1 Good      I1     
#>  2 Good      SI2    
#>  3 Good      VS1    
#>  4 Good      SI1    
#>  5 Very Good VVS2   
#>  6 Very Good SI2    
#>  7 Very Good VS1    
#>  8 Very Good IF     
#>  9 Premium   SI2    
#> 10 Premium   SI1    
#> 11 Ideal     VS1    
#> 12 Ideal     VVS1   
#> 13 Ideal     VS2    
#> 14 Ideal     VVS2   
#> 15 Ideal     SI1    
#> 16 Ideal     SI2    
#> 17 Ideal     IF

^{Created on 2020-07-14 by the reprex package (v0.3.0)}

With dplyr >= 0.2 we can use do function for this:

library(ggplot2)
library(psych)
library(dplyr)
diamonds %>%
    group_by(cut) %>%
    do(describe(.$price)) %>%
    select(-vars)
#> Source: local data frame [5 x 13]
#> Groups: cut [5]
#> 
#>         cut     n     mean       sd median  trimmed      mad   min   max range     skew kurtosis       se
#>      (fctr) (dbl)    (dbl)    (dbl)  (dbl)    (dbl)    (dbl) (dbl) (dbl) (dbl)    (dbl)    (dbl)    (dbl)
#> 1      Fair  1610 4358.758 3560.387 3282.0 3695.648 2183.128   337 18574 18237 1.780213 3.067175 88.73281
#> 2      Good  4906 3928.864 3681.590 3050.5 3251.506 2853.264   327 18788 18461 1.721943 3.042550 52.56197
#> 3 Very Good 12082 3981.760 3935.862 2648.0 3243.217 2855.488   336 18818 18482 1.595341 2.235873 35.80721
#> 4   Premium 13791 4584.258 4349.205 3185.0 3822.231 3371.432   326 18823 18497 1.333358 1.072295 37.03497
#> 5     Ideal 21551 3457.542 3808.401 1810.0 2656.136 1630.860   326 18806 18480 1.835587 2.977425 25.94233

Solution based on the purrr (purrrlyr since 2017) package:

library(ggplot2)
library(psych)
library(purrr)
diamonds %>% 
    slice_rows("cut") %>% 
    by_slice(~ describe(.x$price), .collate = "rows")
#> Source: local data frame [5 x 14]
#> 
#>         cut  vars     n     mean       sd median  trimmed      mad   min   max range     skew kurtosis       se
#>      (fctr) (dbl) (dbl)    (dbl)    (dbl)  (dbl)    (dbl)    (dbl) (dbl) (dbl) (dbl)    (dbl)    (dbl)    (dbl)
#> 1      Fair     1  1610 4358.758 3560.387 3282.0 3695.648 2183.128   337 18574 18237 1.780213 3.067175 88.73281
#> 2      Good     1  4906 3928.864 3681.590 3050.5 3251.506 2853.264   327 18788 18461 1.721943 3.042550 52.56197
#> 3 Very Good     1 12082 3981.760 3935.862 2648.0 3243.217 2855.488   336 18818 18482 1.595341 2.235873 35.80721
#> 4   Premium     1 13791 4584.258 4349.205 3185.0 3822.231 3371.432   326 18823 18497 1.333358 1.072295 37.03497
#> 5     Ideal     1 21551 3457.542 3808.401 1810.0 2656.136 1630.860   326 18806 18480 1.835587 2.977425 25.94233

But it so simply with data.table:

as.data.table(diamonds)[, describe(price), by = cut]
#>          cut vars     n     mean       sd median  trimmed      mad min   max range     skew kurtosis       se
#> 1:     Ideal    1 21551 3457.542 3808.401 1810.0 2656.136 1630.860 326 18806 18480 1.835587 2.977425 25.94233
#> 2:   Premium    1 13791 4584.258 4349.205 3185.0 3822.231 3371.432 326 18823 18497 1.333358 1.072295 37.03497
#> 3:      Good    1  4906 3928.864 3681.590 3050.5 3251.506 2853.264 327 18788 18461 1.721943 3.042550 52.56197
#> 4: Very Good    1 12082 3981.760 3935.862 2648.0 3243.217 2855.488 336 18818 18482 1.595341 2.235873 35.80721
#> 5:      Fair    1  1610 4358.758 3560.387 3282.0 3695.648 2183.128 337 18574 18237 1.780213 3.067175 88.73281

We can write own summary function which returns a list:

fun <- function(x) {
    list(n = length(x),
         min = min(x),
         median = as.numeric(median(x)),
         mean = mean(x),
         sd = sd(x),
         max = max(x))
}
as.data.table(diamonds)[, fun(price), by = cut]
#>          cut     n min median     mean       sd   max
#> 1:     Ideal 21551 326 1810.0 3457.542 3808.401 18806
#> 2:   Premium 13791 326 3185.0 4584.258 4349.205 18823
#> 3:      Good  4906 327 3050.5 3928.864 3681.590 18788
#> 4: Very Good 12082 336 2648.0 3981.760 3935.862 18818
#> 5:      Fair  1610 337 3282.0 4358.758 3560.387 18574

dplyr summarise() with multiple return values from a single function

Tags:

R

Vector

Dplyr

Summarize

Related

Recent Posts