Output a numeric value from 'cut()' in R
Much of the behavior of cut
is related to creating the labels that you're not interested in. You're probably better off using findInterval
or .bincode
.
You would start with the data
set.seed(17)
df <- data.frame(x=300 * runif(100))
Then set the breaks and find the intervals:
breaks <- c(0,25,75,125,175,225,299)
df$interval <- findInterval(df$x, breaks)
df$start <- breaks[df$interval]
df$end <- breaks[df$interval + 1]
I'm guessing at what you want, since if you wanted the "original numbers", you could just use df$x
. I presume you are after some number to reflect the group? In that guess, what about the following.
## Generate some example data
x = runif(5, 0, 300)
## Specify the labels
labels = c(0,25,75,125,175,225)
## Use cut as before
y = cut(x,
breaks = c(0,25,75,125,175,225,300),
labels = labels,
right = TRUE)
When we convert y
to a numeric, this gives the index of the label. Hence,
labels[as.numeric(y)]
or simpler
labels[y]
I would go for the usage of regex since all the information is in the output of cut
.
cut_borders <- function(x){
pattern <- "(\\(|\\[)(-*[0-9]+\\.*[0-9]*),(-*[0-9]+\\.*[0-9]*)(\\)|\\])"
start <- as.numeric(gsub(pattern,"\\2", x))
end <- as.numeric(gsub(pattern,"\\3", x))
data.frame(start, end)
}
The pattern in words:
Group 1: either a
(
or a[
, so we use(\\(|\\[)
.Group 2: number might be negative, so we (
-*
), we are looking for at least one number ([0-9]+
) which can have decimal places, i.e. a point (\\.*
) and decimals after point ([0-9]*
).Next there is a comma (
,
)Group 3: same as group 2.
Group 4: analog to group 1 we are expecting either a
)
or a]
.
Here is some random variable cut with quantiles. The function cut_borders
returns what we are looking for:
x <- rnorm(10)
x_groups <- cut(x, quantile(x, 0:4/4), include.lowest= TRUE)
cut_borders(x_groups)