Calculating length of 95%-CI using dplyr
Update tidyr 1.0.0
All the solutions given by @Valentin are viable but I wanted to hint to a new alternative which is more readable for some of you. It replaces all the summarise
solutions by a relatively new [tidyr 1.0.0][1] function called unnest_wider
.
With that, you can simplify the code to the following:
mtcars %>%
nest(data = -"vs") %>%
mutate(ci = map(data, ~ MeanCI(.x$mpg, method = "boot", R = 1000))) %>%
unnest_wider(ci)
which gives:
# A tibble: 2 x 5
vs data mean lwr.ci upr.ci
<dbl> <list> <dbl> <dbl> <dbl>
1 0 <tibble [18 × 10]> 16.6 14.7 18.5
2 1 <tibble [14 × 10]> 24.6 22.0 27.1
Calculating confidence intervals without bootstrapping is even simpler with:
mtcars %>%
nest(data = -"vs") %>%
mutate(ci = map(data, ~ MeanCI(.x$mpg))) %>%
unnest_wider(ci)
You could do it manually using mutate
a few extra functions in summarise
library(dplyr)
mtcars %>%
group_by(vs) %>%
summarise(mean.mpg = mean(mpg, na.rm = TRUE),
sd.mpg = sd(mpg, na.rm = TRUE),
n.mpg = n()) %>%
mutate(se.mpg = sd.mpg / sqrt(n.mpg),
lower.ci.mpg = mean.mpg - qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg,
upper.ci.mpg = mean.mpg + qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg)
#> Source: local data frame [2 x 7]
#>
#> vs mean.mpg sd.mpg n.mpg se.mpg lower.ci.mpg upper.ci.mpg
#> (dbl) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
#> 1 0 16.61667 3.860699 18 0.9099756 14.69679 18.53655
#> 2 1 24.55714 5.378978 14 1.4375924 21.45141 27.66287
If you want to use the versatility of the boot
package, I found this blog post useful (the code below is inspired from there)
library(dplyr)
library(tidyr)
library(purrr)
library(boot)
set.seed(321)
mtcars %>%
group_by(vs) %>%
nest() %>%
mutate(boot_res = map(data,
~ boot(data = .$mpg,
statistic = function(x, i) mean(x[i]),
R = 1000)),
boot_res_ci = map(boot_res, boot.ci, type = "perc"),
mean = map(boot_res_ci, ~ .$t0),
lower_ci = map(boot_res_ci, ~ .$percent[[4]]),
upper_ci = map(boot_res_ci, ~ .$percent[[5]]),
n = map(data, nrow)) %>%
select(-data, -boot_res, -boot_res_ci) %>%
unnest(cols = c(n, mean, lower_ci, upper_ci)) %>%
ungroup()
#> # A tibble: 2 x 5
#> vs mean lower_ci upper_ci n
#> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 0 16.6 15.0 18.3 18
#> 2 1 24.6 22.1 27.3 14
Created on 2020-01-22 by the reprex package (v0.3.0)
Some explanation of the code:
When nesting with nest()
, a list column (called by default data
) is created, which contains 2 data frames, being the 2 subsets of the entire mtcars
grouped by vs
(which contains 2 unique values, 0 and 1).
Then, using mutate()
and map()
, we create the list column boot_res
by applying the function boot()
from the boot
package to the list column data
. Then boot_res_ci
list column is created by applying the boot.ci()
function to boot_res
list column and so on.
With select()
, we drop the list columns that are not needed anymore, fallowed by unnesting and ungrouping the final results.
The code is unfortunately not easy to navigate but it serves the purpose of another example.
Using broom::tidy()
Just realized that the package broom
has an implementation of a method to deal with boot()
outputs as pointed out here. This makes the code a bit less verbose and the output even a bit more complete, including the bias and the standard error of the statistic (the mean here):
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
library(boot)
set.seed(321)
mtcars %>%
group_by(vs) %>%
nest() %>%
mutate(boot_res = map(data,
~ boot(data = .$mpg,
statistic = function(x, i) mean(x[i]),
R = 1000)),
boot_tidy = map(boot_res, tidy, conf.int = TRUE, conf.method = "perc"),
n = map(data, nrow)) %>%
select(-data, -boot_res) %>%
unnest(cols = -vs) %>%
ungroup()
#> # A tibble: 2 x 7
#> vs statistic bias std.error conf.low conf.high n
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 0 16.6 -0.0115 0.843 15.0 18.3 18
#> 2 1 24.6 -0.0382 1.36 22.1 27.3 14
Created on 2020-01-22 by the reprex package (v0.3.0)
data.table
concise syntax
Note, however, that, I got to a more concise syntax by using the data.table
package instead of dplyr
:
library(data.table)
library(magrittr)
library(boot)
library(broom)
mtcars <- mtcars %>% copy %>% setDT
set.seed(321)
mtcars[, c(n = .N,
boot(data = mpg,
statistic = function(x, i) mean(x[i]),
R = 1000) %>%
tidy(conf.int = TRUE, conf.method = "perc")),
by = vs]
#> vs n statistic bias std.error conf.low conf.high
#> 1: 0 18 16.61667 -0.01149444 0.8425817 15.03917 18.26653
#> 2: 1 14 24.55714 -0.03822857 1.3633112 22.06429 27.32839
Created on 2020-01-23 by the reprex package (v0.3.0)
Multiple variables at once with data.table
library(data.table)
library(magrittr)
library(boot)
library(broom)
mtcars <- mtcars %>% copy %>% setDT
# Specify here the variables for which you want CIs
variables <- c("mpg", "disp")
# Function to get the CI stats, will be applied to each column of a subset of
# data (.SD)
get_ci <- function(varb, ...){
boot(data = varb,
statistic = function(x, i) mean(x[i]),
R = 1000) %>%
tidy(conf.int = TRUE, ...)
}
set.seed(321)
mtcars[, c(n = .N,
lapply(.SD, get_ci) %>%
rbindlist(idcol = "varb")),
by = vs, .SDcols = variables]
#> vs n varb statistic bias std.error conf.low conf.high
#> 1: 0 18 mpg 16.61667 -0.01149444 0.8425817 15.03917 18.26653
#> 2: 0 18 disp 307.15000 -1.49692222 23.1501247 261.18766 353.04416
#> 3: 1 14 mpg 24.55714 -0.03215714 1.3800432 21.86628 27.50551
#> 4: 1 14 disp 132.45714 0.32994286 14.9070552 104.45798 163.57344
Created on 2020-01-23 by the reprex package (v0.3.0)
I use the ci command from the gmodels package:
library(gmodels)
your_db %>% group_by(gouping_variable1, grouping_variable2, ...)
%>% summarise(mean = ci(variable_of_interest)[1],
lowCI = ci(variable_of_interest)[2],
hiCI = ci(variable_of_interest)[3],
sd = ci (variable_of_interest)[4])