Find the most frequent value by row

Something like :

apply(df,1,function(x) names(which.max(table(x))))
[1] "red"    "yellow" "green"

In case there is a tie, which.max takes the first max value. From the which.max help page :

Determines the location, i.e., index of the (first) minimum or maximum of a numeric vector.

Ex :

Click to copy

var4 <- c("yellow","green","yellow")
df <- data.frame(cbind(id, var1, var2, var3, var4))

> df
  id   var1   var2   var3   var4
1  1    red    red yellow yellow
2  2 yellow yellow orange  green
3  3  green  green  green yellow

apply(df,1,function(x) names(which.max(table(x))))
[1] "red"    "yellow" "green"

If your data is quite big you might want to consider using the data.table package.

Click to copy

# Generate the data
nrow <- 10^5
id <- 1:nrow
colors <- c("red","yellow","green")
var1 <- sample(colors, nrow, replace = TRUE)
var2 <- sample(colors, nrow, replace = TRUE)
var3 <- sample(colors, nrow, replace = TRUE)
var4 <- sample(colors, nrow, replace = TRUE)

Mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}

Chargaff's solution is simple and works well in some cases. You can gain a small performance improvement (~20%) using data.table.

Click to copy

df <- data.frame(cbind(id, var1, var2, var3, var4))
system.time(apply(df, 1, Mode))
#   user  system elapsed
#  1.242   0.018   1.264

library(data.table)
dt <- data.table(cbind(id, var1, var2, var3, var4))
system.time(melt(dt, measure = patterns('var'))[, Mode(value1), by = id])
#   user  system elapsed
#  1.020   0.012   1.034

For an internal package I've made a rowMode-function in which you can choose what to do with ties and missing values:

Click to copy

rowMode <- function(x, ties = NULL, include.na = FALSE) {
  # input checks data
  if ( !(is.matrix(x) | is.data.frame(x)) ) {
    stop("Your data is not a matrix or a data.frame.")
  }
  # input checks ties method
  if ( !is.null(ties) && !(ties %in% c("random", "first", "last")) ) {
    stop("Your ties method is not one of 'random', 'first' or 'last'.")
  }
  # set ties method to 'random' if not specified
  if ( is.null(ties) ) ties <- "random"
  
  # create row frequency table
  rft <- table(c(row(x)), unlist(x), useNA = c("no","ifany")[1L + include.na])
  
  # get the mode for each row
  colnames(rft)[max.col(rft, ties.method = ties)]
}

Several possible outputs (based on the different parameter options):

Click to copy

> rowMode(DF[,-1])
 [1] "B" "E" "B" "E" "B" "C" "B" "E" "A" "E"
> rowMode(DF[,-1], ties = "first")
 [1] "B" "B" "B" "A" "B" "C" "B" "E" "A" "E"
> rowMode(DF[,-1], ties = "first", include.na = TRUE)
 [1] "B" NA  "B" NA  "B" "C" "B" "E" "A" "E"
> rowMode(DF[,-1], ties = "last", include.na = TRUE)
 [1] "B" NA  NA  NA  "B" "C" "B" "E" "D" "E"
> rowMode(DF[,-1], ties = "last")
 [1] "B" "C" "B" "E" "B" "C" "B" "E" "D" "E"

Used data:

Click to copy

set.seed(2020)
DF <- data.frame(id = 1:10, matrix(sample(c(LETTERS[1:5], NA_character_), 60, TRUE), ncol = 6))

Find the most frequent value by row

Tags:

R

Count

Mode

Factors

Related

Recent Posts