How to find top n% of records in a column of a dataframe using R
For the top 5% also:
head(data[order(data$V2,decreasing=T),],.05*nrow(data))
Another solution could be use for sqldf
if the data is sorted based on the V1
value:
library(sqldf)
sqldf('SELECT * FROM df
ORDER BY V1
LIMIT (SELECT 0.05 * COUNT(*) FROM df)
')
You can change the rate form 0.05
(5%
) to any required rate.
For the top 5%:
n <- 5
data[data$V2 > quantile(data$V2,prob=1-n/100),]
A dplyr
solution could look like this:
obs <- nrow(data)
data %>% filter(row_number() < obs * 0.05)
This only works if the data is sorted, but your question and example data implies this. If the data is unsorted, you will need to arrange
it by the variable you're interested in:
data <- data %>% arrange(desc(V2))