Can rbind be parallelized in R?
Because you said that you want to rbind data.frame
objects you should use the data.table
package. It has a function called rbindlist
that enhance drastically rbind
. I am not 100% sure but I would bet any use of rbind
would trigger a copy when rbindlist
does not.
Anyway a data.table
is a data.frame
so you do not loose anything to try.
EDIT:
library(data.table)
system.time(dt <- rbindlist(pieces))
utilisateur système écoulé
0.12 0.00 0.13
tables()
NAME NROW MB COLS KEY
[1,] dt 1,000 8 X1,X2,X3,X4,X5,X6,X7,X8,...
Total: 8MB
Lightning fast...
I haven't found a way to do this in parallel either thus far. However for my dataset (this one is a list of about 1500 dataframes totaling 4.5M rows) the following snippet seemed to help:
while(length(lst) > 1) {
idxlst <- seq(from=1, to=length(lst), by=2)
lst <- lapply(idxlst, function(i) {
if(i==length(lst)) { return(lst[[i]]) }
return(rbind(lst[[i]], lst[[i+1]]))
})
}
where lst is the list. It seemed to be about 4 times faster than using do.call(rbind, lst)
or even do.call(rbind.fill, lst)
(with rbind.fill from the plyr package). In each iteration this code is halving the amount of dataframes.