Why is the parallel package slower than just using apply?
Completely agree with @Arun and @PaulHiemestra arguments concerning Why...? part of your question.
However, it seems that you can take some benefits from parallel
package in your situation (at least if you are not stuck with Windows). Possible solution is using mclapply
instead of parSapply
, which relies on fast forking and shared memory.
tm2 <- system.time({
tm3 <- system.time({
df7 <- matrix(unlist(mclapply(df2$var1, FUN=function(x) {x==df1$var1}, mc.cores=8)), nrow=i)
dimnames(df7) <- list(row.names(df1), row.names(df2))
})
})
Of course, nested system.time
is not needed here. With my 2 cores I got:
These differences can be attributed to 1) communication overhead (especially if you run across nodes) and 2) performance overhead (if your job is not that intensive compared to initiating a parallelisation, for example). Usually, if the task you are parallelising is not that time-consuming, then you will mostly find that parallelisation does NOT have much of an effect (which is much highly visible on huge datasets.
Even though this may not directly answer your benchmarking, I hope this should be rather straightforward and can be related to. As an example, here, I construct a data.frame
with 1e6
rows with 1e4
unique column group
entries and some values in column val
. And then I run using plyr
in parallel
using doMC
and without parallelisation.
df <- data.frame(group = as.factor(sample(1:1e4, 1e6, replace = T)),
val = sample(1:10, 1e6, replace = T))
> head(df)
group val
# 1 8498 8
# 2 5253 6
# 3 1495 1
# 4 7362 9
# 5 2344 6
# 6 5602 9
> dim(df)
# [1] 1000000 2
require(plyr)
require(doMC)
registerDoMC(20) # 20 processors
# parallelisation using doMC + plyr
P.PLYR <- function() {
o1 <- ddply(df, .(group), function(x) sum(x$val), .parallel = TRUE)
}
# no parallelisation
PLYR <- function() {
o2 <- ddply(df, .(group), function(x) sum(x$val), .parallel = FALSE)
}
require(rbenchmark)
benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
2 PLYR() 2 8.925 1.000 8.865 0.068 0.000 0.000
1 P.PLYR() 2 30.637 3.433 15.841 13.945 8.944 38.858
As you can see, the parallel version of plyr
runs 3.5 times slower
Now, let me use the same data.frame
, but instead of computing sum
, let me construct a bit more demanding function, say, median(.) * median(rnorm(1e4)
((meaningless, yes):
You'll see that the tides are beginning to shift:
# parallelisation using doMC + plyr
P.PLYR <- function() {
o1 <- ddply(df, .(group), function(x)
median(x$val) * median(rnorm(1e4)), .parallel = TRUE)
}
# no parallelisation
PLYR <- function() {
o2 <- ddply(df, .(group), function(x)
median(x$val) * median(rnorm(1e4)), .parallel = FALSE)
}
> benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
1 P.PLYR() 2 41.911 1.000 15.265 15.369 141.585 34.254
2 PLYR() 2 73.417 1.752 73.372 0.052 0.000 0.000
Here, the parallel version is 1.752 times
faster than the non-parallel version.
Edit: Following @Paul's comment, I just implemented a small delay using Sys.sleep()
. Of course the results are obvious. But just for the sake of completeness, here's the result on a 20*2 data.frame:
df <- data.frame(group=sample(letters[1:5], 20, replace=T), val=sample(20))
# parallelisation using doMC + plyr
P.PLYR <- function() {
o1 <- ddply(df, .(group), function(x) {
Sys.sleep(2)
median(x$val)
}, .parallel = TRUE)
}
# no parallelisation
PLYR <- function() {
o2 <- ddply(df, .(group), function(x) {
Sys.sleep(2)
median(x$val)
}, .parallel = FALSE)
}
> benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 P.PLYR() 2 4.116 1.000 0.056 0.056 0.024 0.04
# 2 PLYR() 2 20.050 4.871 0.028 0.000 0.000 0.00
The difference here is not surprising.
Running jobs in parallel incurs overhead. Only if the jobs you fire at the worker nodes take a significant amount of time does parallelization improve overall performance. When the individual jobs take only milliseconds, the overhead of constantly firing off jobs will deteriorate overall performance. The trick is to divide the work over the nodes in such a way that the jobs are sufficiently long, say at least a few seconds. I used this to great effect running six Fortran models simultaneously, but these individual model runs took hours, almost negating the effect of overhead.
Note that I haven't run your example, but the situation I describe above is often the issue when parallization takes longer than running sequentially.