Apache flink on Kubernetes - Resume job if jobmanager crashes
Out of the box, Flink requires a ZooKeeper cluster to recover from JobManager crashes. However, I think you can have a lightweight implementation of the HighAvailabilityServices
, CompletedCheckpointStore
, CheckpointIDCounter
and SubmittedJobGraphStore
which can bring you quite far.
Given that you have only one JobManager running at all times (not entirely sure whether K8s can guarantee this) and that you have a persistent storage location, you could implement a CompletedCheckpointStore
which retrieves the completed checkpoints from the persistent storage system (e.g. reading all stored checkpoint files). Additionally, you would have a file which contains the current checkpoint id counter for CheckpointIDCounter
and all the submitted job graphs for the SubmittedJobGraphStore
. So the basic idea is to store everything on a persistent volume which is accessible by the single JobManager.