Spark Streaming Processing Time vs Total Delay vs Processing Delay
Let's break down each metric. For that, let's define a basic streaming application which reads a batch at a given 4 second interval from some arbitrary source, and computes the classic word count:
inputDStream.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile("hdfs://...")
Processing Time: The time it takes to compute a given batch for all its jobs, end to end. In our case this means a single job which starts at
flatMap
and ends atsaveAsTextFile
, and assumes as a prerequisite that the job has been submitted.Scheduling Delay: The time taken by Spark Streaming scheduler to submit the jobs of the batch. How is this computed? As we've said, our batch reads from the source every 4 seconds. Now let's assume that a given batch took 8 seconds to compute. This means that we're now
8 - 4 = 4
seconds behind, thus making the scheduling delay 4 seconds long.Total Delay: This is
Scheduling Delay + Processing Time
. Following the same example, if we're 4 seconds behind, meaning our scheduling delay is 4 seconds, and the next batch took another 8 seconds to compute, this means that the total delay is now8 + 4 = 12
seconds long.
A live example from a working Streaming application:
We see that:
- The bottom job took 11 seconds to process. So now the next batches scheduling delay is
11 - 4 = 7
seconds. - If we look at the second row from the bottom, we see that scheduling delay + processing time = total delay, in that case (rounding 0.9 to 1)
7 + 1 = 8
.