Effective parallel processing of large items

Seralization of the data to a ByteArray object seems to overcome the data transfer bottleneck. The necessary functions BinarySerialize and BinaryDeserialize have been introduced in 11.1.

Here is a simple function implementing a ParallelMap which serializes the data before the transfer to the subkernels and makes the subkernels deseralize it before processing:

ParallelMapSerialized[f_, data_, opts___] := ParallelMap[
  f[BinaryDeserialize@#] &,
  BinarySerialize /@ data,

Running the benchmark again:

map = Map[
    FindCurvePath[#[[1 ;; difficulty]]] &,
    ]; // AbsoluteTiming

(* {9.60715, Null} *)

pmap = ParallelMap[
    FindCurvePath[#[[1 ;; difficulty]]] &,
    Method -> "ItemsPerEvaluation" -> 10
    ]; // AbsoluteTiming

(* {17.5937, Null} *)

pmapserialized = ParallelMapSerialized[
    FindCurvePath[#[[1 ;; difficulty]]] &,
    Method -> "ItemsPerEvaluation" -> 10
    ]; // AbsoluteTiming

(* {1.85387, Null} *)

pmap === pmap2 === pmapserialized
(* True *)

Serialization led to a performance increase of almost 10-fold compared to ParallelMap, and to a 5-fold increase compared to serial processing.

Sometimes it helps to make the shared variable local first.



In this case it is enough to just do the copying inside the loop rather than in the ParallelMap range.

index = Range[Length[randomValues]];
pmap4 = ParallelMap[
    FindCurvePath[randomValues[[#, 1 ;; difficulty]]] &, 
    index]; // AbsoluteTiming