How can I save files in parallel without automatically increasing the file size?
I have not used ddply to parallelize saving objects, so I guess the file gets much larger because when you save model object, it carrys also some information about the environment from which it is saved.
So using your ddply code above, the sizes I have are:
sapply(dir(pattern="RData"),file.size)
setosa.RData versicolor.RData virginica.RData
36002 36002 36002
There are two options, one is to use purrr / furrr:
library(furrr)
library(purrr)
func = function(SpeciesData){
Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)
save(Model,
compress = FALSE,
file = gsub(x = "Species.RData",
pattern = "Species",
replacement = unique(SpeciesData$Species)))
}
split(iris,iris$Species) %>% future_map(func)
sapply(dir(pattern="RData"),file.size)
setosa.RData versicolor.RData virginica.RData
25426 27156 27156
Or to use saveRDS (and ddply?) since you only have one object to save:
ddply(.data = iris,
.variables = "Species",
.parallel=TRUE,##With parallel
.fun = function(SpeciesData){
Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)
saveRDS(Model,
gsub(x = "Species.rds",
pattern = "Species",
replacement = unique(SpeciesData$Species)))
})
sapply(dir(pattern="rds"),file.size)
setosa.rds versicolor.rds virginica.rds
6389 6300 6277
You will do readRDS
instead of load
to get the file:
m1 = readRDS("setosa.rds")
m1
Call:
lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",
data = SpeciesData)
Coefficients:
(Intercept) Sepal.Width Petal.Length Petal.Width
2.3519 0.6548 0.2376 0.2521
We can look at the coefficients in comparison with the rda object:
m2 = get(load("setosa.RData"))
m2
Call:
lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",
data = SpeciesData)
Coefficients:
(Intercept) Sepal.Width Petal.Length Petal.Width
2.3519 0.6548 0.2376 0.2521
The objects are not identical because of the environment parts, but in terms of prediction or other stuff we normally use it for, it works:
identical(predict(m1,data.frame(iris[1:10,])),predict(m2,data.frame(iris[1:10,])))
As others mentioned, there might be some small amount of information about the environment that's being saved in the files or similar that you probably wouldn't notice except that the files are so small.
If you're just interested in file size, try saving the models into a single list and then save that into one file. ddply
can only handle a data.frame as a result from the function, so we have to use dlply
instead to tell it to store the results in a list. Doing this saved to just one file that was 60k.
Here's an example of what I'm talking about:
library("plyr")
doSNOW::registerDoSNOW(cl<-snow::makeCluster(3))
models<-dlply(.data = iris,
.variables = "Species",
.parallel=TRUE,##With parallel
.fun = function(SpeciesData){
#Create Simple Model -------------------------------------------------------------
lm(formula = Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = SpeciesData)
})
snow::stopCluster(cl)
save(models, compress= FALSE, file= 'combined_models')