How can I save files in parallel without automatically increasing the file size?

I have not used ddply to parallelize saving objects, so I guess the file gets much larger because when you save model object, it carrys also some information about the environment from which it is saved.

So using your ddply code above, the sizes I have are:

sapply(dir(pattern="RData"),file.size)
setosa.RData versicolor.RData  virginica.RData 
       36002            36002            36002

There are two options, one is to use purrr / furrr:

library(furrr)
library(purrr)

func = function(SpeciesData){
  Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)
  save(Model,
       compress = FALSE,
       file = gsub(x =  "Species.RData",
                   pattern = "Species",
                   replacement = unique(SpeciesData$Species)))
}

split(iris,iris$Species) %>% future_map(func)

sapply(dir(pattern="RData"),file.size)
    setosa.RData versicolor.RData  virginica.RData 
           25426            27156            27156

Or to use saveRDS (and ddply?) since you only have one object to save:

ddply(.data = iris,
      .variables = "Species",
      .parallel=TRUE,##With parallel
      .fun = function(SpeciesData){
        Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)
        saveRDS(Model,
             gsub(x =  "Species.rds",
                         pattern = "Species",
                         replacement = unique(SpeciesData$Species)))

      })

sapply(dir(pattern="rds"),file.size)
    setosa.rds versicolor.rds  virginica.rds 
          6389           6300           6277

You will do readRDS instead of load to get the file:

m1 = readRDS("setosa.rds")
m1
Call:
lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width", 
    data = SpeciesData)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      2.3519        0.6548        0.2376        0.2521

We can look at the coefficients in comparison with the rda object:

m2 = get(load("setosa.RData"))
m2

Call:
lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width", 
    data = SpeciesData)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      2.3519        0.6548        0.2376        0.2521

The objects are not identical because of the environment parts, but in terms of prediction or other stuff we normally use it for, it works:

identical(predict(m1,data.frame(iris[1:10,])),predict(m2,data.frame(iris[1:10,])))

As others mentioned, there might be some small amount of information about the environment that's being saved in the files or similar that you probably wouldn't notice except that the files are so small.

If you're just interested in file size, try saving the models into a single list and then save that into one file. ddply can only handle a data.frame as a result from the function, so we have to use dlply instead to tell it to store the results in a list. Doing this saved to just one file that was 60k.

Here's an example of what I'm talking about:

library("plyr")
doSNOW::registerDoSNOW(cl<-snow::makeCluster(3))
models<-dlply(.data = iris,
      .variables = "Species",
      .parallel=TRUE,##With parallel
      .fun = function(SpeciesData){

        #Create Simple Model -------------------------------------------------------------  
        lm(formula = Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = SpeciesData)
      })
snow::stopCluster(cl)

save(models, compress= FALSE, file= 'combined_models')

How can I save files in parallel without automatically increasing the file size?

Tags:

R

Plyr

Related

Recent Posts