How to make upstart back off, rather than give up
Solution 1:
The Upstart Cookbook recommends a post-stop delay (http://upstart.ubuntu.com/cookbook/#delay-respawn-of-a-job). Use the respawn
stanza without arguments and it will continue trying forever:
respawn
post-stop exec sleep 5
(I got this from this Ask Ubuntu question)
To add the exponential delay part, I'd try working with an environment variable in the post-stop script, I think something like:
env SLEEP_TIME=1
post-stop script
sleep $SLEEP_TIME
NEW_SLEEP_TIME=`expr 2 \* $SLEEP_TIME`
if [ $NEW_SLEEP_TIME -ge 60 ]; then
NEW_SLEEP_TIME=60
fi
initctl set-env SLEEP_TIME=$NEW_SLEEP_TIME
end script
** EDIT **
To apply the delay only when respawning, avoiding the delay on a real stop, use the following, which checks whether the current goal is "stop" or not:
env SLEEP_TIME=1
post-stop script
goal=`initctl status $UPSTART_JOB | awk '{print $2}' | cut -d '/' -f 1`
if [ $goal != "stop" ]; then
sleep $SLEEP_TIME
NEW_SLEEP_TIME=`expr 2 \* $SLEEP_TIME`
if [ $NEW_SLEEP_TIME -ge 60 ]; then
NEW_SLEEP_TIME=60
fi
initctl set-env SLEEP_TIME=$NEW_SLEEP_TIME
fi
end script
Solution 2:
As already mentioned, use respawn
to trigger the respawn.
However, the Upstart Cookbook coverage on respawn-limit
says that you'll need to specify respawn limit unlimited
to have continual retry behaviour.
By default it will retry as long as the process doesn't respawn more than 10 times in 5 seconds.
I would therefore suggest:
respawn
respawn limit unlimited
post-stop <script to back-off or constant delay>
Solution 3:
I ended up putting a start
in a cronjob. If the service is running, it has no effect. If it's not running, it starts the service.
Solution 4:
I have done an improvement to Roger answer. Typically you want to backoff when there is a problem in the underlying software causing it to crash a lot in a short period of time but once the system has recovered you want to reset the backoff time. In Roger's version the service will sleep for 60 seconds always, even for single and isolated crashes after 7 crashes.
#The initial delay.
env INITIAL_SLEEP_TIME=1
#The current delay.
env CURRENT_SLEEP_TIME=1
#The maximum delay
env MAX_SLEEP_TIME=60
#The unix timestamp of the last crash.
env LAST_CRASH=0
#The number of seconds without any crash
#to consider the service healthy and reset the backoff.
env HEALTHY_TRESHOLD=180
post-stop script
exec >> /var/log/auth0.log 2>&1
echo "`date`: stopped $UPSTART_JOB"
goal=`initctl status $UPSTART_JOB | awk '{print $2}' | cut -d '/' -f 1`
if [ $goal != "stop" ]; then
CRASH_TIMESTAMP=$(date +%s)
if [ $LAST_CRASH -ne 0 ]; then
SECS_SINCE_LAST_CRASH=`expr $CRASH_TIMESTAMP - $LAST_CRASH`
if [ $SECS_SINCE_LAST_CRASH -ge $HEALTHY_TRESHOLD ]; then
echo "resetting backoff"
CURRENT_SLEEP_TIME=$INITIAL_SLEEP_TIME
fi
fi
echo "backoff for $CURRENT_SLEEP_TIME"
sleep $CURRENT_SLEEP_TIME
NEW_SLEEP_TIME=`expr 2 \* $CURRENT_SLEEP_TIME`
if [ $NEW_SLEEP_TIME -ge $MAX_SLEEP_TIME ]; then
NEW_SLEEP_TIME=$MAX_SLEEP_TIME
fi
initctl set-env CURRENT_SLEEP_TIME=$NEW_SLEEP_TIME
initctl set-env LAST_CRASH=$CRASH_TIMESTAMP
fi
end script