R: how to rbind two huge data-frames without running out of memory
Notice the data.table
R package for efficient operations on objects with over several million records.
Version 1.8.2 of that package offers the rbindlist
function through which you can achieve what you want very efficiently. Thus instead of rbind(a5r, a6r)
you can:
library(data.table)
rbindlist(list(a5r, a6r))
Rather than reading them into R at the beginning and then combining them you could have SQLite read them and combine them before sending them to R. That way the files are never individually loaded into R.
# create two sample files
DF1 <- data.frame(A = 1:2, B = 2:3)
write.table(DF1, "data1.dat", sep = ",", quote = FALSE)
rm(DF1)
DF2 <- data.frame(A = 10:11, B = 12:13)
write.table(DF2, "data2.dat", sep = ",", quote = FALSE)
rm(DF2)
# now we do the real work
library(sqldf)
data1 <- file("data1.dat")
data2 <- file("data2.dat")
sqldf(c("select * from data1",
"insert into data1 select * from data2",
"select * from data1"),
dbname = tempfile())
This gives:
> sqldf(c("select * from data1", "insert into data1 select * from data2", "select * from data1"), dbname = tempfile())
A B
1 1 2
2 2 3
3 10 12
4 11 13
This shorter version also works if row order is unimportant:
sqldf("select * from data1 union select * from data2", dbname = tempfile())
See the sqldf home page http://sqldf.googlecode.com and ?sqldf
for more info. Pay particular attention to the file format arguments since they are close but not identical to read.table
. Here we have used the defaults so it was less of an issue.
Try to create a data.frame
of desired size, hence import your data using subscripts.
dtf <- as.data.frame(matrix(NA, 10, 10))
dtf1 <- as.data.frame(matrix(1:50, 5, 10, byrow=TRUE))
dtf2 <- as.data.frame(matrix(51:100, 5, 10, byrow=TRUE))
dtf[1:5, ] <- dtf1
dtf[6:10, ] <- dtf2
I guess that rbind
grows object without pre-allocating its dimensions... I'm not positively sure, this is only a guess. I'll comb down "The R Inferno" or "Data Manipulation with R" tonight. Maybe merge
will do the trick...
EDIT
And you should bare in mind that (maybe) your system and/or R cannot cope with something that big. Try RevolutionR, maybe you'll manage to spare some time/resources.