Calculating percentile of dataset column
If you order a vector x
, and find the values that is half way through the vector, you just found a median, or 50th percentile. Same logic applies for any percentage. Here are two examples.
x <- rnorm(100)
quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1)) # quartile
quantile(x, probs = seq(0, 1, by= 0.1)) # decile
table_ages <- subset(infert, select=c("age"))
summary(table_ages)
# age
# Min. :21.00
# 1st Qu.:28.00
# Median :31.00
# Mean :31.50
# 3rd Qu.:35.25
# Max. :44.00
This is probably what they're looking for. summary(...)
applied to a numeric returns the min, max, mean, median, and 25th and 75th percentile of the data.
Note that
summary(infert$age)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 21.00 28.00 31.00 31.50 35.25 44.00
The numbers are the same but the format is different. This is because table_ages
is a data frame with one column (ages), whereas infert$age
is a numeric vector. Try typing summary(infert)
.
Using {dplyr}:
library(dplyr)
# percentiles
infert %>%
mutate(PCT = ntile(age, 100))
# quartiles
infert %>%
mutate(PCT = ntile(age, 4))
# deciles
infert %>%
mutate(PCT = ntile(age, 10))
The quantile()
function will do much of what you probably want, but since the question was ambiguous, I will provide an alternate answer that does something slightly different from quantile()
.
ecdf(infert$age)(infert$age)
will generate a vector of the same length as infert$age
giving the proportion of infert$age
that is below each observation. You can read the ecdf
documentation, but the basic idea is that ecdf()
will give you a function that returns the empirical cumulative distribution. Thus ecdf(X)(Y)
is the value of the cumulative distribution of X at the points in Y. If you wanted to know just the probability of being below 30 (thus what percentile 30 is in the sample), you could say
ecdf(infert$age)(30)
The main difference between this approach and using the quantile()
function is that quantile()
requires that you put in the probabilities to get out the levels, and this requires that you put in the levels to get out the probabilities.