Outgrowing cron: what's the next scheduler?

Solution 1:

Condor, OGE, and Torque can all get you there but only Condor has built-in dependency management with it's DAGMan tool. DAGMan lets you set up a directed, acyclic graph that describes your work flow and the manager takes care of moving through jobs in your workflow and evaluating pass/fail results at each step in the flow. Condor is relatively platform agnostic, which means DAGMan is too, and you can certainly have one child step run on AIX when the parent ran on Linux or Windows. DAGMan isn't concerned with where jobs run, just that exit codes are pass or fail.

Any tips for choosing the software or whether it is better to go open source or commercial?

With some caveats I think the free communities in this space are well worth looking at.

OGE is in a weird space now. It's no longer free to run the Oracle-produced GE variant and Oracle is no longer contributing code it writes back to the GE SCC, but there are several forks of the code that exist that are trying to soldier on as free, open source projects. Univa in particular has lead the charge, hiring ex-Sun GE devs to continue to work on an open source, freely available GE variant. Grid Engine has two things going for it: it's easy to setup, it can handle short running (<2 minute) jobs without imparting a lot of scheduling overhead on the jobs that slows down throughput. It's big downside is there is not very good support for Windows. Some of us put some efforts in to porting it to run on Cygwin many years ago, but it's not as good as native that's for sure.

Now Condor is my favourite of the three technologies you mentioned. There's a strong community around Condor and the software is very mature (>20 years old now). Native Windows and POSIX OS support means it runs all over the place very well. The aforementioned DAGMan is just one of the many great pieces that come with Condor. It can be a touch complicated to set up, but once it's up and running it's rock solid. It has an incredibly flexible language for doing job <-> machine matching and building your use rules for your resources. It also supports dynamic provisioning on machines, letting jobs select how much of machines resources they need and then re-advertising the difference as being still available. It supports global resource counters so you can constrain against things like software licenses. And of course, it has DAGMan, which is an incredibly powerful tool for workflow management. The downside to Condor is the scheduling overhead for short-running jobs can be burdensome. You want jobs that run longer than 2 minutes ideally, otherwise scheduling starts to become a big part of the job's time in the system.

Torque is a little more niche. I know less about it I'm afraid. It compares more to Grid Engine than Condor. There are paid add-ons that @warren mentioned that can expand what the basic, free Torque can do.

If you want to try out the three technologies and see how they work with your specific workloads, CycleCloud can spin up secure, virtualized, pools that are pre-configured with Condor, GridEngine or Torque -- so no time spent in figuring that stuff out on your part. It'd be a few dollars to spin up small pools of each technology and try them with representative workloads. (Disclaimer: I work for Cycle Computing, we make CycleCloud)

Solution 2:

Chronos looks very promising.

Chronos is Airbnb's replacement for cron. It is a distributed and fault-tolerant scheduler that runs on top of Apache Mesos. You can use it to orchestrate jobs. It supports custom Mesos executors as well as the default command executor. Thus by default, Chronos executes sh (on most systems bash) scripts. Chronos can be used to interact with systems such as Hadoop (incl. EMR), even if the Mesos slaves on which execution happens do not have Hadoop installed. Included wrapper scripts allow transfering files and executing them on a remote machine in the background and using asynchronous callbacks to notify Chronos of job completion or failures.

I've also head great personal success using Jenkins as a cron replacement. It handles executing jobs on remote servers quite nicely. Here's a writeup on it: http://www.22ideastreet.com/blog/2014/05/02/replace-local-cron-with-jenkins/


Solution 3:

For the past 4.5 years, I have worked with HP's (nee Opsware) Server Automation platform, and the rest of the Business Technology Optimization suite (Network Automation, Operations Orchestration, etc).

For a large enough environment, job management via SA is a highly-viable (and desirable) tool. In conjunction with OO, jobs can be controlled via change control management, ticketing, etc.

Here's the not-so-fun part: it's pricey (very pricey). You might check some of the suggestions in a similar question I asked a while back: FLOSS Server management and audit tools.

I'd also say that Torque/Maui/Moab (from Adaptive Computing) are very cool: not sure on pricing, but they are highly flexible tools as well.


Disclaimer - I work for a partner of HP BTO and Adaptive


Solution 4:

NOTE A completely different take on the problem!

cron is old and clunky in certain terms.

If you are indeed looking for new ways to do scheduling I'd try something event based with a messaging middleware. Think RabbitMQ with clients on each server.

Inter Host dependencies can be solved by "notification queues".

"Real" Time based events are a little trickier, that's actually what cron is for (and is quite good at, at least regarding small environments). Where it get's tricky to get hold of the idea is to prevent hickups. Like in: every night at 0100h do a snapshot. You might see some load spikes or a lot of failing logins at that very moment thruought your whole infrastructure. If you have a queue based a approach you'll get at least some deviation for free (although it's not guaranteed -- unless some logic implements that).

The thing to get around is that without real time based jobs you can't rely on things like: yeah my backups will start at 0200h and if they still run on 0400h something's wrong. What's easier to do is making sure that no 2 jobs that interfere are run at the same time. Just make a blocking agent that will only consume one job at a time.

The managing part would be some nice web interface where jobs could be submitted either on-demand, or -- now it get's back to "cron" or your favorite implementation of it the java quartz scheduler has a granularity on seconds AFAIK -- for the time based part just use good old cron :)

Please don't downvote me for being OT -- it's a rather rough concept but since the question doesn't rule out money one might as well spend the money to get the solution for the exact in-house requirements by creating something rather than spending the money by buying something where a vendor thinks that it fullfills some requirements :)