How to name the list of the group_split output in dplyr

Not sure, if this can be done directly. One way is by sampling the dataframe and then use it's unique names to setNames.

library(dplyr)

df <- iris %>% sample_n(size = 5) 

df %>%
   group_split(Species) %>%
   setNames(unique(df$Species))


#$setosa
# A tibble: 1 x 5
#  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#1            5         3.4          1.5         0.2 setosa 

#$versicolor
# A tibble: 1 x 5
#  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#1            6         3.4          4.5         1.6 versicolor

#$virginica
# A tibble: 3 x 5
#  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
#         <dbl>       <dbl>        <dbl>       <dbl> <fct>    
#1          7.3         2.9          6.3         1.8 virginica
#2          6.9         3.1          5.1         2.3 virginica
#3          7.7         3            6.1         2.3 virginica

It is weird that group_split doesn't directly name the lists because it is supposed to be an alternative to base::split which does name it.

split(df, df$Species)

The document says :

group_split() works like base::split() but

  • it uses the grouping structure from group_by() and therefore is subject to the data mask
  • it does not name the elements of the list based on the grouping as this typically loses information and is confusing.

For the updated dataset it doesn't work because while naming we are using unique which gets the data in the same order as they appear whereas group_split, splits the data based on increasing order of their value. (So the order of splitting is Cluster1,Cluster11, Cluster2...) One way to overcome that is to convert Cluster to factor and specify levels as they appear using unique.

df <- df %>%
      mutate(Cluster = factor(Cluster, levels = unique(Cluster))) 

df %>%
   group_split(Cluster) %>%
   setNames(unique(df$Cluster))

OR if you don't want them as factors do

df %>%
  group_split(Cluster) %>%
  setNames(sort(unique(df$Cluster)))

Lots of good answers. You can also just do:

iris %>% sample_n(size = 5) %>% 
  split(f = as.factor(.$Species))

Which will give you:

$setosa
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4          5.5         3.5          1.3         0.2  setosa
5          5.3         3.7          1.5         0.2  setosa

$versicolor
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
3            5         2.3          3.3           1 versicolor

$virginica
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          7.7         2.6          6.9         2.3 virginica
2          7.2         3.0          5.8         1.6 virginica

Also works with your dataframe above:

df %>% 
  split(f = as.factor(.$Cluster))

Gives you:

$Cluster1
# A tibble: 1 x 6
  Cluster  gene_name    p_value morans_test_statistic morans_I    q_value
  <chr>    <chr>          <dbl>                 <dbl>    <dbl>      <dbl>
1 Cluster1 Grhpr     0.00000155                  4.66   0.0261 0.00000343

$Cluster11
# A tibble: 2 x 6
  Cluster   gene_name  p_value morans_test_statistic morans_I  q_value
  <chr>     <chr>        <dbl>                 <dbl>    <dbl>    <dbl>
1 Cluster11 Vimp      3.17e-62                 16.6    0.0948 1.62e-61
2 Cluster11 Fgfr1op2  2.07e- 8                  5.48   0.0310 4.98e- 8

$Cluster12
# A tibble: 1 x 6
  Cluster   gene_name p_value morans_test_statistic morans_I q_value
  <chr>     <chr>       <dbl>                 <dbl>    <dbl>   <dbl>
1 Cluster12 Pikfyve    0.0147                  2.18   0.0120  0.0245

$Cluster6
# A tibble: 1 x 6
  Cluster  gene_name  p_value morans_test_statistic morans_I  q_value
  <chr>    <chr>        <dbl>                 <dbl>    <dbl>    <dbl>
1 Cluster6 Zfp398    0.000354                  3.39   0.0188 0.000684

$Cluster8
# A tibble: 2 x 6
  Cluster  gene_name   p_value morans_test_statistic morans_I   q_value
  <chr>    <chr>         <dbl>                 <dbl>    <dbl>     <dbl>
1 Cluster8 Golga7    4.14e-  6                  4.46   0.0251 8.96e-  6
2 Cluster8 Lars2     3.93e-184                 28.9    0.165  3.48e-183

$Cluster9
# A tibble: 3 x 6
  Cluster  gene_name   p_value morans_test_statistic morans_I   q_value
  <chr>    <chr>         <dbl>                 <dbl>    <dbl>     <dbl>
1 Cluster9 Tbc1d8    3.47e- 47                  14.4   0.0815 1.58e- 46
2 Cluster9 H1f0      9.46e-131                  24.3   0.139  7.00e-130
3 Cluster9 Ankrd13a  1.43e- 38                  12.9   0.0737 5.96e- 38

Tags:

R

Dplyr

Tidyverse