Kubernetes CronJob Stops Scheduling Jobs
How kubernetes jobs handle failures
As per Jobs - Run to Completion - Handling Pod and Container Failures:
An entire Pod can
alsofail, for a number of reasons, such as when the pod is kicked off the node (node is upgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the.spec.template.spec.restartPolicy = "Never"
. When a Pod fails, then the Job controller starts a new Pod.
You are using restartPolicy: Never
for your jobTemplate
, so, see the next quote on Pod backoff failure policy:
There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set
.spec.backoffLimit
to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. The back-off count is reset if no new failed Pods appear before the Job’s next status check.
The .spec.backoffLimit
is not defined in your jobTemplate
, so it's using the default (6
).
Following, as per Job Termination and Cleanup:
By default, a Job will run uninterrupted unless a Pod fails, at which point the Job defers to the
.spec.backoffLimit
described above. Another way to terminate a Job is by setting an active deadline. Do this by setting the.spec.activeDeadlineSeconds
field of the Job to a number of seconds.
That's your case: If your containers fail to pull the image six consecutive times, your Job will be considered as failed.
Cronjobs
As per Cron Job Limitations:
A cron job creates a job object about once per execution time of its schedule [...]. The Cronjob is only responsible for creating Jobs that match its schedule, and the Job in turn is responsible for the management of the Pods it represents.
This means that all pod/container failures should be handled by the Job Controller (i.e., adjusting the jobTemplate
).
"Retrying" a Job:
You do not need to recreate a Cronjob in case its Job of fails. You only need to wait for the next schedule.
If you want to run a new Job before the next schedule, you can use the Cronjob template to create a Job manually with:
kubectl create job --from=cronjob/my-cronjob-name my-manually-job-name
What you should do:
If your containers are unable to download the images constantly, you have the following options:
- Explicit set and tune
backoffLimit
to a higher value. - Use
restartPolicy: OnFailure
for your containers, so the Pod will stay on the node, and only the container will be re-run. - Consider using
imagePullPolicy: IfNotPresent
. If you are not retagging your images, there is no need to force a re-pull for every job start.
Just to expand on Eduardo Baitello's answer I would also like to mention 2 more caveats:
Eduardo mentioned Cronjob Limitations, but didn't expand on the
Too many missed start time (> 100)
issue. For this I've found that the only solution is to delete the cronjob and recreate it. You can patch the cronjob to decrease its frequency which tricks the scheduler to run it again. Then you can re-patch it back to how it was but this is trickier. Thekubectl describe cronjob CRONJOB_NAME
should list this as one of its events if this has been affected, and it usually affects cronjobs which have a high frequency.If you have a lot of
Cronjobs
/Jobs
then you could be experiencing this bug (#77465) which has been fixed in1.14.7
. This occurs if you have more than500
Jobs within the entire cluster. This one is harder to find, but you can query thekube-scheduler
logs forexpected type *batchv1.JobList, got type *internalversion.List
.
You can print the logs for kube-scheduler
using the following command:
kubectl -n kube-system logs -l component=kube-scheduler --tail 100