How to find top n% of records in a column of a dataframe using R

For the top 5% also:

head(data[order(data$V2,decreasing=T),],.05*nrow(data))

Another solution could be use for sqldf if the data is sorted based on the V1 value:

library(sqldf)
sqldf('SELECT * FROM df
       ORDER BY V1
       LIMIT (SELECT 0.05 * COUNT(*) FROM df)
      ')

You can change the rate form 0.05 (5%) to any required rate.

For the top 5%:

n <- 5
data[data$V2 > quantile(data$V2,prob=1-n/100),]

A dplyr solution could look like this:

obs <- nrow(data) 
data %>% filter(row_number() < obs * 0.05)

This only works if the data is sorted, but your question and example data implies this. If the data is unsorted, you will need to arrange it by the variable you're interested in:

data <- data %>% arrange(desc(V2))

How to find top n% of records in a column of a dataframe using R

Tags:

R

Dataframe

Related

Recent Posts