Spreading a two column data frame with tidyr
Another base
answer (that also looks like fast):
data.frame(split(df$b,df$a))
While I'm aware you're after tidyr
, base
has a solution in this case:
unstack(df, b~a)
It's also a little bit faster:
Unit: microseconds
expr min lq mean median uq max neval
df %>% spread(a, b) 657.699 679.508 717.7725 690.484 724.9795 1648.381 100
unstack(df, b ~ a) 309.891 335.264 349.4812 341.9635 351.6565 639.738 100
By popular demand, with something bigger
I haven't included the data.table
solution as I'm not sure if pass by reference would be a problem for microbenchmark
.
library(microbenchmark)
library(tidyr)
library(magrittr)
nlevels <- 3
#Ensure that all levels have the same number of elements
nrow <- 1e6 - 1e6 %% nlevels
df <- data.frame(a=sample(rep(c("x", "y", "z"), length.out=nrow)),
b=sample.int(9, nrow, replace=TRUE))
microbenchmark(df %>% spread(a, b), unstack(df, b ~ a), data.frame(split(df$b,df$a)), do.call(cbind,split(df$b,df$a)))
Even on 1 million, unstack is faster. Notably, the split
solution is also very fast.
Unit: milliseconds
expr min lq mean median uq max neval
df %>% spread(a, b) 366.24426 414.46913 450.78504 453.75258 486.1113 542.03722 100
unstack(df, b ~ a) 47.07663 51.17663 61.24411 53.05315 56.1114 102.71562 100
data.frame(split(df$b, df$a)) 19.44173 19.74379 22.28060 20.18726 22.1372 67.53844 100
do.call(cbind, split(df$b, df$a)) 26.99798 27.41594 31.27944 27.93225 31.2565 79.93624 100
Somehow like this?
df <- data.frame(ind = rep(1:min(table(df$a)), length(unique(df$a))), df)
df %>% spread(a, b) %>% select(-ind)
ind x y z
1 1 8 3 5
2 2 6 4 6