Effective parallel processing of large items
Seralization of the data to a ByteArray
object seems to overcome the data transfer bottleneck. The necessary functions BinarySerialize
and BinaryDeserialize
have been introduced in 11.1.
Here is a simple function implementing a ParallelMap
which serializes the data before the transfer to the subkernels and makes the subkernels deseralize it before processing:
ParallelMapSerialized[f_, data_, opts___] := ParallelMap[
f[BinaryDeserialize@#] &,
BinarySerialize /@ data,
opts
]
Running the benchmark again:
map = Map[
FindCurvePath[#[[1 ;; difficulty]]] &,
randomValues
]; // AbsoluteTiming
(* {9.60715, Null} *)
pmap = ParallelMap[
FindCurvePath[#[[1 ;; difficulty]]] &,
randomValues,
Method -> "ItemsPerEvaluation" -> 10
]; // AbsoluteTiming
(* {17.5937, Null} *)
pmapserialized = ParallelMapSerialized[
FindCurvePath[#[[1 ;; difficulty]]] &,
randomValues,
Method -> "ItemsPerEvaluation" -> 10
]; // AbsoluteTiming
(* {1.85387, Null} *)
pmap === pmap2 === pmapserialized
(* True *)
Serialization led to a performance increase of almost 10-fold compared to ParallelMap, and to a 5-fold increase compared to serial processing.
Sometimes it helps to make the shared variable local first.
pmap=ParallelMap[FindCurvePath[#[[1;;difficulty]]]&,randomValues];//AbsoluteTiming
({3.51073,Null})
index=Range[Length[randomValues]];
pmap3=ParallelMap[Module[{r=randomValues[[#]]},FindCurvePath[r[[1;;difficulty]]]]&,index];//AbsoluteTiming
{1.13677,Null}
In this case it is enough to just do the copying inside the loop rather than in the ParallelMap range.
index = Range[Length[randomValues]];
pmap4 = ParallelMap[
FindCurvePath[randomValues[[#, 1 ;; difficulty]]] &,
index]; // AbsoluteTiming
{1.13,Null}