Splitting a continuous variable into equal sized groups
try this:
split(das, cut(das$anim, 3))
if you want to split based on the value of wt
, then
library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))
anyway, you can do that by combining cut
, cut2
and split
.
UPDATED
if you want a group index as an additional column, then
das$group <- cut(das$anim, 3)
if the column should be index like 1, 2, ..., then
das$group <- as.numeric(cut(das$anim, 3))
UPDATED AGAIN
try this:
> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
anim wt wt2
1 1 181.0 1
2 2 179.0 1
3 3 180.5 1
4 4 201.0 2
5 5 201.5 2
6 6 245.0 2
7 7 246.4 3
8 8 189.3 1
9 9 301.0 3
10 10 354.0 3
11 11 369.0 3
12 12 205.0 2
13 13 199.0 1
14 14 394.0 3
15 15 231.3 2
If you want to split into 3 equally distributed groups, the answer is the same as Ben Bolker's answer above - use ggplot2::cut_number()
. For sake of completion here are the 3 methods of converting continuous to categorical (binning).
cut_number()
: Makes n groups with (approximately) equal numbers of observationcut_interval()
: Makes n groups with equal rangecut_width()
: Makes groups of width
My go-to is cut_number()
because this uses evenly spaced quantiles for binning observations. Here's an example with skewed data.
library(tidyverse)
skewed_tbl <- tibble(
counts = c(1:100, 1:50, 1:20, rep(1:10, 3),
rep(1:5, 5), rep(1:2, 10), rep(1, 20))
) %>%
mutate(
counts_cut_number = cut_number(counts, n = 4),
counts_cut_interval = cut_interval(counts, n = 4),
counts_cut_width = cut_width(counts, width = 25)
)
# Data
skewed_tbl
#> # A tibble: 265 x 4
#> counts counts_cut_number counts_cut_interval counts_cut_width
#> <dbl> <fct> <fct> <fct>
#> 1 1 [1,3] [1,25.8] [-12.5,12.5]
#> 2 2 [1,3] [1,25.8] [-12.5,12.5]
#> 3 3 [1,3] [1,25.8] [-12.5,12.5]
#> 4 4 (3,13] [1,25.8] [-12.5,12.5]
#> 5 5 (3,13] [1,25.8] [-12.5,12.5]
#> 6 6 (3,13] [1,25.8] [-12.5,12.5]
#> 7 7 (3,13] [1,25.8] [-12.5,12.5]
#> 8 8 (3,13] [1,25.8] [-12.5,12.5]
#> 9 9 (3,13] [1,25.8] [-12.5,12.5]
#> 10 10 (3,13] [1,25.8] [-12.5,12.5]
#> # ... with 255 more rows
summary(skewed_tbl$counts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 3.00 13.00 25.75 42.00 100.00
# Histogram showing skew
skewed_tbl %>%
ggplot(aes(counts)) +
geom_histogram(bins = 30)
# cut_number() evenly distributes observations into bins by quantile
skewed_tbl %>%
ggplot(aes(counts_cut_number)) +
geom_bar()
# cut_interval() evenly splits the interval across the range
skewed_tbl %>%
ggplot(aes(counts_cut_interval)) +
geom_bar()
# cut_width() uses the width = 25 to create bins that are 25 in width
skewed_tbl %>%
ggplot(aes(counts_cut_width)) +
geom_bar()
Created on 2018-11-01 by the reprex package (v0.2.1)
Here's another solution using the bin_data()
function from the mltools package.
library(mltools)
# Resulting bins have an equal number of observations in each group
das[, "wt2"] <- bin_data(das$wt, bins=3, binType = "quantile")
# Resulting bins are equally spaced from min to max
das[, "wt3"] <- bin_data(das$wt, bins=3, binType = "explicit")
# Or if you'd rather define the bins yourself
das[, "wt4"] <- bin_data(das$wt, bins=c(-Inf, 250, 322, Inf), binType = "explicit")
das
anim wt wt2 wt3 wt4
1 1 181.0 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
2 2 179.0 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
3 3 180.5 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
4 4 201.0 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
5 5 201.5 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
6 6 245.0 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
7 7 246.4 [245.466666666667, 394] [179, 250.666666666667) [-Inf, 250)
8 8 189.3 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
9 9 301.0 [245.466666666667, 394] [250.666666666667, 322.333333333333) [250, 322)
10 10 354.0 [245.466666666667, 394] [322.333333333333, 394] [322, Inf]
11 11 369.0 [245.466666666667, 394] [322.333333333333, 394] [322, Inf]
12 12 205.0 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
13 13 199.0 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
14 14 394.0 [245.466666666667, 394] [322.333333333333, 394] [322, Inf]
15 15 231.3 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
Or see cut_number
from the ggplot2
package, e.g.
das$wt_2 <- as.numeric(cut_number(das$wt,3))
Note that cut(...,3)
divides the range of the original data into three ranges of equal lengths; it doesn't necessarily result in the same number of observations per group if the data are unevenly distributed (you can replicate what cut_number
does by using quantile
appropriately, but it's a nice convenience function). On the other hand, Hmisc::cut2()
using the g=
argument does split by quantiles, so is more or less equivalent to ggplot2::cut_number
. I might have thought that something like cut_number
would have made its way into dplyr
by so far, but as far as I can tell it hasn't.