Apache Spark: Difference between parallelize and broadcast

sc.parallelize(...) spread the data amongst all executors

sc.broadcast(...) copy the data in the jvm of each executor

An RDD in Spark is just a collection split into partitions (at least one). Each partition lives on an executor which process it. With sc.parallelize(), your collection is split in partitions assigned to executors, so for example you could have [1,2] on an executor, [3] on another, [4,5] on another one. In this way executors process the partitions in parallel. With broadcast as GwydionFR said, the passed parameter is copied to each executor.

Apache Spark: Difference between parallelize and broadcast

Tags:

Apache Spark

Pyspark

Related

Recent Posts