Spark Streaming: stateless overlapping windows vs. keeping state

I think one of other drawbacks of third approach is that the RDDs are not received chronologically..considering running them on a cluster..

ongoingEventsStream.foreachRDD { /*accumulate state in casssandra*/ }

also what about check-pointing and driver node failure..In that case do u read the whole data again? curious to know how you wanna handle this?

I guess maybe mapwithstate is a better approach why you consider all these scenario..


Normally there is no right approach, each has tradeoffs. Therefore I'd add additional approach to the mix and will outline my take on their pros and cons. So you can decide which one is more suitable for you.

External state approach (approach #3)

You can accumulate state of the events in external storage. Cassandra is quite often used for that. You can handle final and ongoing events separately for example like below:

val stream = ...

val ongoingEventsStream = stream.filter(!isFinalEvent)
val finalEventsStream = stream.filter(isFinalEvent)

ongoingEventsStream.foreachRDD { /*accumulate state in casssandra*/ }
finalEventsStream.foreachRDD { /*finalize state in casssandra, move to final destination if needed*/ }

trackStateByKey approach (approach #1.1)

It might be potentially optimal solution for you as it removes drawbacks of updateStateByKey, but considering it is just got released as part of Spark 1.6 release, it could be risky as well (since for some reason it is not very advertised). You can use the link as starting point if you want to find out more

Pros/Cons

Approach #1 (updateStateByKey)

Pros

  • Easy to understand or explain (to rest of the team, newcomers, etc.) (subjective)
  • Storage: Better usage of memory stores only latest state of exercise
  • Storage: Will keep only ongoing exercises, and discard them as soon as they finish
  • Latency is limited only by performance of each micro-batch processing

Cons

  • Storage: If number of keys (concurrent exercises) is large it may not fit into memory of your cluster
  • Processing: It will run updateState function for each key within the state map, therefore if number of concurrent exercises is large - performance will suffer

Approach #2 (window)

While it is possible to achieve what you need with windows, it looks significantly less natural in your scenario.

Pros

  • Processing in some cases (depending on the data) might be more effective than updateStateByKey, due to updateStateByKey tendency to run update on every key even if there are no actual updates

Cons

  • "maximal possible exercise time" - this sounds like a huge risk - it could be pretty arbitrary duration based on a human behaviour. Some people might forget to "finish exercise". Also depends on kinds of exercise, but could range from seconds to hours, when you want lower latency for quick exercises while would have to keep latency as high as longest exercise potentially could exist
  • Feels like harder to explain to others on how it will work (subjective)
  • Storage: Will have to keep all data within the window frame, not only the latest one. Also will free the memory only when window will slide away from this time slot, not when exercise is actually finished. While it might be not a huge difference if you will keep only last two time slots - it will increase if you try to achieve more flexibility by sliding window more often.

Approach #3 (external state)

Pros

  • Easy to explain, etc. (subjective)
  • Pure streaming processing approach, meaning that spark is responsible to act on each individual event, but not trying to store state, etc. (subjective)
  • Storage: Not limited by memory of the cluster to store state - can handle huge number of concurrent exercises
  • Processing: State is updated only when there are actual updates to it (unlike updateStateByKey)
  • Latency is similar to updateStateByKey and only limited by the time required to process each micro-batch

Cons

  • Extra component in your architecture (unless you already use Cassandra for your final output)
  • Processing: by default is slower than processing just in spark as not in-memory + you need to transfer the data via network
  • you'll have to implement exactly once semantic to output data into cassandra (for the case of worker failure during foreachRDD)

Suggested approach

I'd try the following:

  • test updateStateByKey approach on your data and your cluster
  • see if memory consumption and processing is acceptable even with large number of concurrent exercises (expected on peak hours)
  • fall back to approach with Cassandra in case if not