How to diagnose causes of oom-killer killing processes
Solution 1:
No, the algorithm is not that simplistic. You can find more information in:
http://linux-mm.org/OOM_Killer
If you want to track memory usage, I'd recommend running a command like:
ps -e -o pid,user,cpu,size,rss,cmd --sort -size,-rss | head
It will give you a list of the processes that are using the most memory (and probably causing the OOM situation). Remove the | head
if you'd prefer to check all the processes.
If you put this on your cron, repeat it every 5 minutes and save it to a file. Keep at least a couple of days, so you can check what happened later.
For critical services like ssh, I'd recommend using monit for auto restarting them in such a situation. It might save from losing access to the machine if you don't have a remote console to it.
Best of luck,
João Miguel Neves
Solution 2:
I had a hard time with that recently, because the process(es) that the oom-killer stomps on aren't necessarily the ones that have gone awry. While trying to diagnose that, I learned about one of my now-favorite tools, atop.
This utility is like a top on steroids. Over a pre-set time interval, it profiles system information. You can then play it back to see what's going on. It highlights processes that ar 80%+ in blue and 90%+ in red. The most useful view is a memory usage table of how much memory was allocated in the last time period. That's the one that helped me the most.
Fantastic tool -- can't say enough about it.
atop performance monitor