Randomly sample groups

I think this approach makes the most sense if you are using dplyr:

iris_grouped <- iris %>% 
  group_by(Species) %>% 
  nest()

Which produces:

# A tibble: 3 x 2
  Species    data             
  <fct>      <list>           
1 setosa     <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica  <tibble [50 × 4]>

with which you can then use sample_n:

iris_grouped %>%
  sample_n(2)

# A tibble: 2 x 2
  Species    data             
  <fct>      <list>           
1 virginica  <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>

Just use sample() to choose some number of groups

iris %>% filter(Species %in% sample(levels(Species),2))

Take note that using dplyr is considerably slower than regular data frame operations:

library(microbenchmark)
microbenchmark(dplyr= iris %>% filter(Species %in% sample(levels(Species),2)),
               base= iris[iris[["Species"]] %in% sample(levels(iris[["Species"]]), 2),])

Unit: microseconds
  expr     min      lq     mean  median       uq      max neval cld
 dplyr 660.287 710.655 753.6704 722.629 771.2860 1122.527   100   b
  base  83.629  95.032 110.0936 106.057 119.1715  199.949   100  a

Note [[ is known to be faster than $, although both work

Randomly sample groups

Tags:

R

Dplyr

Related

Recent Posts