Sample n random rows per group in a dataframe
Here's a solution. We split a data.frame into color groups. Then we sample 3 rows from each group. This yields a list of data.frames.
df2 <- lapply(split(df, df$color),
function(subdf) subdf[sample(1:nrow(subdf), 3),]
)
To obtain the desired result, we merge the list of data.frames into 1 data.frame:
do.call('rbind', df2)
## X1 X2 color
## blue.3 -1.22677188 1.25648082 blue
## blue.4 -0.54516686 -1.94342967 blue
## blue.1 0.44647071 0.16283326 blue
## pink.40 0.23520296 -0.40411906 pink
## pink.34 0.02033939 -0.32321309 pink
## pink.33 -1.01790533 -1.22618575 pink
## red.16 1.86545895 1.11691250 red
## red.11 1.35748078 -0.36044728 red
## red.13 -0.02425645 0.85335279 red
## yellow.21 1.96728782 -1.81388110 yellow
## yellow.25 -0.48084967 0.07865186 yellow
## yellow.24 -0.07056236 -0.28514125 yellow
I would consider my stratified
function, which is presently hosted as a GitHub Gist.
Get it with:
library(devtools) ## To download "stratified"
source_gist("https://gist.github.com/mrdwab/6424112")
And use it with:
stratified(df, "color", 3)
There are several different features that are convenient for stratified sampling. For instance, you can also take a sample sort of "on the fly".
stratified(df, "color", 3, select = list(color = c("blue", "red")))
To give you a sense of what the function does, here are the arguments to stratified
:
df
: The inputdata.frame
group
: A character vector of the column or columns that make up the "strata".size
: The desired sample size.- If
size
is a value less than 1, a proportionate sample is taken from each stratum. - If
size
is a single integer of 1 or more, that number of samples is taken from each stratum. - If
size
is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would entersize = c(A = 5, B = 10)
.
- If
select
: This allows you to subset the groups in the sampling process. This is alist
. For instance, if yourgroup
variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can useselect = list(Group = c("A", "C"))
.replace
: For sampling with replacement.
In versions of dplyr
0.3 and later, this works just fine:
df %>% group_by(color) %>% sample_n(size = 3)
Old versions of dplyr
(version <= 0.2)
I set out to answer this using dplyr, assuming that this would work:
df %.% group_by(color) %.% sample_n(size = 3)
But it turns out that in 0.2 the sample_n.grouped_df
S3 method exists but isn't registered in the NAMESPACE file, so it's never dispatched. Instead, I had to do this:
df %.% group_by(color) %.% dplyr:::sample_n.grouped_df(size = 3)
Source: local data frame [12 x 3]
Groups: color
X1 X2 color
8 0.66152710 -0.7767473 blue
1 -0.70293752 -0.2372700 blue
2 -0.46691793 -0.4382669 blue
32 -0.47547565 -1.0179842 pink
31 -0.15254540 -0.6149726 pink
39 0.08135292 -0.2141423 pink
15 0.47721644 -1.5033192 red
16 1.26160230 1.1202527 red
12 -2.18431919 0.2370912 red
24 0.10493757 1.4065835 yellow
21 -0.03950873 -1.1582658 yellow
28 -2.15872261 -1.5499822 yellow
Presumably this will be fixed in a future update.
You can assign a random ID to each element that has a particular factor level using ave
. Then you can select all random IDs in a certain range.
rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]
This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid
vector to create subset of different lengths fairly easily.