Using parLapply and clusterExport inside a function

Another solution is to include the additional variables as arguments to your function; parLapply exports them too. If 'text.var' is the big data, then it pays to make it the argument that is applied to, rather than an index, because then only the portion of text.var relevant to each worker is exported, rather than the whole object to each worker.

par.test <- function(text.var, gc.rate=10){ 
    require(parallel)
    pos <-  function(i) {
        paste(sapply(strsplit(tolower(i), " "), nchar), collapse=" | ")
    }
    cl <- makeCluster(mc <- getOption("cl.cores", 4))
    parLapply(cl, text.var, function(text.vari, gc.rate, pos) {
        x <- pos(text.vari)
        if (i%%gc.rate==0) gc()
        x
    }, gc.rate, pos)
}

This is also conceptually pleasing. (It's rarely necessary to explicitly invoke the garbage collector).

Memory management when source()ing a script causes additional problems. Compare

> stop("oops")
Error: oops
> traceback()
1: stop("oops")

with the same call in a script

> source("foo.R")
Error in eval(ei, envir) : oops
> traceback()
5: stop("oops") at foo.R#1
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("foo.R")

Remember that R's serialize() function, used internally by parLapply() to move data to workers, serializes everything up to the .GlobalEnv. So data objects created in the script are serialized to the worker, whereas if run interactively they would not be serialized. This may account for @SeldeomSeenSlim's problems when running a script. Probably the solution is to more clearly separate 'data' from 'algorithm', e.g., using the file system or data base or ... to store objects.


By default clusterExport looks in the .GlobalEnv for objects to export that are named in varlist. If your objects are not in the .GlobalEnv, you must tell clusterExport in which environment it can find those objects.

You can change your clusterExport to the following (which I didn't test, but you said works in the comments)

clusterExport(cl=cl, varlist=c("text.var", "ntv", "gc.rate", "pos"), envir=environment())

This way, it will look in the function's environment for the objects to export.