Do stages in an application run parallel in spark?
How do stages execute in a Spark job
Stages of a job can run in parallel if there is no dependencies among them.
In Spark, stages are split by boundries. You have a shuffle stage, which is a boundary stage where transformations are split at, i.e. reduceByKey
, and you have a result stage, which are stages that are bound to yield a result without causing a shuffle, i.e. a map
operation:
(Picture provided by Cloudera)
Since groupByKey
is a shuffle stage, you see the split in pink boxes which marks a boundary.
Internally, a stage is further divided into tasks. e.g in the picture above, the first row which does textFile -> map -> filter
, can be split into three tasks, one for each transformation.
When one transformations output is another transformations input, we need the serial execution. But, if stages are unrelated, i.e hadoopFile -> groupByKey -> map
, they can run in parallel. Once they declare a dependency between them from that stage on they will continue execution serially.
Check the entities(stages, partitions) in this pic:
pic credits
Does stages in a job(spark application ?) run parallel in spark?
Yes, they can be executed in parallel if there is no sequential dependency.
Here Stage 1 and Stage 2 partitions can be executed in parallel but not Stage 0 partitions, because of dependency partitions in Stage 1 & 2 has to be processed.
Is there any consistency in execution of stages that can be defined by programmer or will it derived by spark engine?
Stage boundary is defined by when data shuffling happens among partitions. (check pink lines in pic)