Apache Spark: Difference between parallelize and broadcast
sc.parallelize(...)
spread the data amongst all executors
sc.broadcast(...)
copy the data in the jvm of each executor
An RDD in Spark is just a collection split into partitions (at least one). Each partition lives on an executor which process it. With sc.parallelize(), your collection is split in partitions assigned to executors, so for example you could have [1,2] on an executor, [3] on another, [4,5] on another one. In this way executors process the partitions in parallel. With broadcast as GwydionFR said, the passed parameter is copied to each executor.