How can I automatically kill idle GCE instances based on CPU usage?
The following worked for me. It's a bash script which uses the uptime
UNIX command to check whether the 15-minute average load on the CPU is below a threshold, and automatically shuts down the system if this is true on ten consecutive checks. You need to run this within your VM instance.
Credit, and more detailed explanation: Rohit Rawat's blog.
#!/bin/bash
threshold=0.4
count=0
while true
do
load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
res=$(echo $load'<'$threshold | bc -l)
if (( $res ))
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
sudo poweroff
fi
sleep 60
done
It seems like there are two parts to this question:
- Identifying dead instances.
- Killing off those instances.
In terms of identifying dead instances, one way to do this would be to have a separate, management instance that does not run this software and that keeps tabs on the other instances. For example, it could do this by periodically sending a health request to the various instances and marking non-responsive instances or instances reporting an overly high CPU usage as unhealthy.
Once your management instance has identified the unhealthy instances that need to be reset, you should be able to reset those other instances using the API (I'm guessing the reset command) or by executing the same operation using the gcloud commandline tool.