How to convert Linux cron jobs to "the Amazon way"?

Amazon has just released new features for Elastic Beanstalk. From the docs:

AWS Elastic Beanstalk supports periodic tasks for worker environment
tiers in environments running a predefined configuration with a solution stack that contains "v1.2.0" in the container name. "

You can now create an environment containing a cron.yaml file that configures scheduling tasks:

version: 1
cron:
- name: "backup-job"          # required - unique across all entries in this file
  url: "/backup"              # required - does not need to be unique
  schedule: "0 */12 * * *"    # required - does not need to be unique
- name: "audit"
  url: "/audit"
   schedule: "0 23 * * *"

I would imagine the insurance of running it only once in an autoscaled environment is utilized via the message queue (SQS). When the cron daemon triggers an event it puts that call in the SQS queue and the message in the queue is only evaluated once. The docs say that execution might be delayed if SQS has many messages to process.


I signed up for Amazon Gold support to ask them this question, this was their response:

Tom

I did a quick poll of some of my colleagues and came up empty on the cron, but after sleeping on it I realised the important step may be limited to locking. So I looked for "distributed cron job locking" and found a reference to Zookeeper, an Apache project.

http://zookeeper.apache.org/doc/r3.2.2/recipes.html

http://highscalability.com/blog/2010/3/22/7-secrets-to-successfully-scaling-with-scalr-on-amazon-by-se.html

Also I have seen reference to using memcached or a similar caching mechanism as a way to create locks with a TTL. In this way you set a flag, with a TTL of 300 seconds and no other cron worker will execute the job. The lock will automatically be released after the TTL has expired. This is conceptually very similar to the SQS option we discussed yesterday.

Also see; Google's chubby http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/chubby-osdi06.pdf

Let me know if this helps, and feel free to ask questions, we are very aware that our services can be complex and daunting to both beginners and seasoned developers alike. We are always happy to offer architecture and best practice advice.

Best regards,

Ronan G. Amazon Web Services


I think this video answers your exact question - cronjobs the aws way (scalable and fault tolerant):

Using Cron in the Cloud with Amazon Simple Workflow

The video describes the SWF service using the specific use case of implementing cronjobs.

The relative complexity of the solution can be hard to swallow if you are coming straight from a crontab. There is a case study at the end that helped me understand what that extra complexity buys you. I would suggest watching the case study and considering your requirements for scalability and fault tolerance to decide whether you should migrate from your existing crontab solution.


Be careful with using SQS for cronjobs, as they don't guarantee that only "one job is seen by only one machine". They guarantee that "at least one" will got the message.

From: http://aws.amazon.com/sqs/faqs/#How_many_times_will_I_receive_each_message

Q: How many times will I receive each message?

Amazon SQS is engineered to provide “at least once” delivery of all messages in its queues. Although most of the time each message will be delivered to your application exactly once, you should design your system so that processing a message more than once does not create any errors or inconsistencies.

So far I can think about the solution where you have one instance with Gearman Job Server instance installed: http://gearman.org/. On the same machine you configure cron jobs that are producing command to execute your cronjob task in background. Then one of your web servers (workers) will start executing this task, it guarantees that only one will take it. It doesn't matter how many workers you have (especially when you are using auto scaling).

The problems with this solution are:

  • Gearman server is single point of failure, unless you configure it with distributed storage, for example using memcached or some database
  • Then using multiple Gearman servers you have to select one that creates task via cronjob, so again we are back to the same problem. But if you can live with this kind of single point of failure using Gearman looks like quite good solution. Especially that you don't need big instance for that (micro instance in our case is enough).