Scheduled batch jobs and durability
Rather than scheduling a one-time job, schedule a recurring job.
Schedule the job to run on an hourly interval (every hour). As part of the finishing phase of your job, cancel this hourly schedule and replace it with another similar hourly schedule where the first execution is set to be a short period (let's say 5 minutes) from the finish of the job.
This works in a very similar way to using a "one off" schedule (as per your existing implementation) - in both of these implementations the job is rescheduled in the finish phase, but by using a recurring schedule you have the added benefit that if for any reason the job does not execute, the platform will attempt to run it again an hour later, and every hour until it succeeds.
Note that we don't know why the job may fail to execute - but we're assuming that it relates to platform maintenance. Chaining one-off scheduled jobs together relies on the successful start and completion of each job for the integrity of the chain, whereas using a recurring scheduled job provides "auto-resume" behaviour regardless of the successful start / completion of an individual job.
Example process flow:
(1) at 12:00 we schedule a job to run every every hour, at 5 minutes past the hour: 12:05,13:05,14:05...etc...
(2) at 12:05 the batch manager job is started according to the hourly schedule, and this checks your custom batch job object records to see if there is any work currently running or waiting.
It finds that there are no jobs running but there is a job waiting: "Foo". The batch manager therefore starts the batch process for Foo.
(3) at 13:05 the batch manager job is started according to the hourly schedule.
On this occasion it finds that job Foo is in progress and so quits taking no action.
(4) at 13:35 job Foo finishes.
In the finish phase, the existing hourly scheduled job is cancelled, and another new hourly job is scheduled, this time to run at 40 minutes past the hour: 13:40, 14:40, 15:40...etc…
(5) at 13:40 the batch manager job is due to start according to the hourly schedule, but this fails (we assume because of platform maintenance)
(6) at 14:40 the batch manager job is started according to the hourly schedule.
It finds that there are no jobs running but there is a job waiting: "Bar". The batch manager therefore starts the batch process for Bar.
etc.
I have seen this type of behaviour on a number of occasions. It seems to be related to (in my experience) what you've said above :
if the main batch manager finds no jobs to do, it schedules itself to run again a short time in the future.
And on occasions that 'short time' passes before the job has had chance to successfully execute, either as a result of high load or as you say above a maintenance window.
Have you considered a second 'keep-alive' scheduled job that say runs once an hour (less likely to be impacted by aforementioned issues) that checks things are in order and if not reschedules your initial job?
Edit: As it happens, I am doing some testing in a Sandbox right now where we have increased the scheduled job to run every minute (just while users are doing some hefty testing) and I'm getting this very issue, in this situation the scheduled job ends up not having a Next Start time and hangs in limbo. So, the Keep Alive job is going to looks something like:
global class KeepAlive implements schedulable
{
global void execute( SchedulableContext SC )
{
// have the worker job store its job id in custom setting
JobIdState__c jobIdState = JobIdState__c.getInstance();
Id workerId = (Id)jobIdState.JobId__c;
CronTrigger dead = [ select Id From CronTrigger where Id = :workerId And NextFireTime = null ];
// abort the dead job and start a new one
if( dead != null )
{
System.abortJob( workerId );
// start new one again in a minute
Datetime sysTime = System.now().addSeconds( 60 );
String chronExpression = '' + sysTime.second() + ' ' + sysTime.minute() + ' ' + sysTime.hour() + ' ' + sysTime.day() + ' ' + sysTime.month() + ' ? ' + sysTime.year();
System.schedule( 'Worker Scheduler ' + sysTime, chronExpression, new WorkerScheduler() );
}
// abort me and start again
System.abortJob( SC.getTriggerId() );
KeepAlive.start();
}
public static void start()
{
// start keepalive again in 5 mins
Datetime sysTime = System.now().addSeconds( 300 );
String chronExpression = '' + sysTime.second() + ' ' + sysTime.minute() + ' ' + sysTime.hour() + ' ' + sysTime.day() + ' ' + sysTime.month() + ' ? ' + sysTime.year();
System.schedule( 'KeepAlive ' + sysTime, chronExpression, new KeepAlive() );
}
}
I've just implemented this now, and will obviously have to wait until the next failure to determine whether it has worked or not but thought it might be useful.
Edit 2: Jobs scheduled in code do count towards limits, however spent one's without a NextFireTime don't seem to.
How about a "dead man's switch" using a "sentinel" custom object, time-based workflow rule and a trigger?:
- You have a sentinel object, where a single record represents your batch job manager
- Your batch job manager updates the sentinel record with a datetime field to say "I'm alive now"
- Your time-based workflow fires a period of time after "I'm alive now", and sets a "reboot" field on the record
- A trigger on your sentinel object reschedules the batch job manager when it sees a "reboot"
NB: Completely untried :-)