Faster way to summarise variables based on column
na.omit
does a ton of unnecessary checks and operations. Replacing it with simple is.na
call will give you a decent speedup:
concat3 = function(x) {
x = x[!is.na(x)]
if (length(x) == 0)
NA_character_
else
paste(x, collapse = ",")
}
Using docendo's data (but with strings instead of factors - factors slow all versions down):
microbenchmark(dt[, lapply(.SD, concat3), by = proid, .SDcols = -c("X4")],
dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")],
times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# dt[, lapply(.SD, concat3), by = proid, .SDcols = -c("X4")] 960.2475 1079.177 1251.545 1342.684 1402.571 1473.045 5
# dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")] 1718.8121 1892.696 2159.148 2171.772 2470.205 2542.253 5
Edit note: removed original part of my answer that didn't address NA treatment & added a benchmark.
concat2 <- function(x) if(all(is.na(x))) NA_character_ else paste(na.omit(x), collapse = ",")
Using data.table:
setDT(df)[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")]
# proid X1 X2 X3
#1: 1 zz,cd a,s e,f
#2: 2 ff,ta g,b z,h
#3: 3 NA t e
Using dplyr:
df %>% group_by(proid) %>% summarise_each(funs(concat2), -X4)
Benchmark, smaller data than in actual use case and not fully representative, so just wanted to get an impression how concat2
compares to concat
etc..
library(microbenchmark)
library(dplyr)
library(data.table)
N <- 1e6
x <- c(letters, LETTERS)
df <- data.frame(
proid = sample(1e4, N, TRUE),
X1 = sample(sample(c(x, NA), N, TRUE)),
X2 = sample(sample(c(x, NA), N, TRUE)),
X3 = sample(sample(c(x, NA), N, TRUE)),
X4 = sample(sample(c(x, NA), N, TRUE))
)
dt <- as.data.table(df)
concat <- function(x){
x <- na.omit(x)
if(length(x)==0){
return(as.character(NA))
}else{
return(paste(x,collapse=","))
}
}
concat2 <- function(x) if(all(is.na(x))) NA_character_ else paste(na.omit(x), collapse = ",")
concat.dplyr <- function(){
df %>% group_by(proid) %>% summarise_each(funs(concat), -X4)
}
concat2.dplyr <- function(){
df %>% group_by(proid) %>% summarise_each(funs(concat2), -X4)
}
concat.data.table <- function(){
dt[, lapply(.SD, concat), by = proid, .SDcols = -c("X4")]
}
concat2.data.table <- function(){
dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")]
}
microbenchmark(concat.dplyr(),
concat2.dplyr(),
concat.data.table(),
concat2.data.table(),
unit = "relative",
times = 10L)
Unit: relative
expr min lq median uq max neval
concat.dplyr() 1.058839 1.058342 1.083728 1.105907 1.080883 10
concat2.dplyr() 1.057991 1.065566 1.109099 1.145657 1.079201 10
concat.data.table() 1.024101 1.018443 1.093604 1.085254 1.066560 10
concat2.data.table() 1.000000 1.000000 1.000000 1.000000 1.000000 10
Findings: data.table performs a little bit faster than dplyr on the sample data and concat2
is a little faster than concat
. However, differences remain rather small on this sample data set.