How to find out why process was killed on server
If a process is consuming too much memory then the kernel "Out of Memory" (OOM) killer will automatically kill the offending process. It sounds like this may have happened to your job. The kernel log should show OOM killer actions, so use the "dmesg" command to see what happened, e.g.
dmesg | less
You will see a OOM killer messages, something like the following:
[ 54.125380] Out of memory: Kill process 8320 (stress-ng-brk) score 324 or sacrifice child
[ 54.125382] Killed process 8320 (stress-ng-brk) total-vm:1309660kB, anon-rss:1287796kB, file-rss:76kB
[ 54.522906] gmain invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[ 54.522908] gmain cpuset=accounts-daemon.service mems_allowed=0
[ 54.522912] CPU: 6 PID: 1032 Comm: gmain Not tainted 4.4.0-0-generic #3-Ubuntu
[ 54.522913] Hardware name: Intel Corporation Skylake Client platform/Skylake DT DDR4 RVP8, BIOS SKLSE2R1.R00.B089.B00.1506160228 06/16/2015
[ 54.522914] 0000000000000000 000000002d879fe9 ffff88016d727a58 ffffffff813d8604
[ 54.522915] ffff88016d727c50 ffff88016d727ac8 ffffffff8120272e 0000000000000015
[ 54.522916] 0000000000000000 ffff880080ab3600 ffff880086725880 ffff88016d727ab8
[ 54.522917] Call Trace:
[ 54.522921] [<ffffffff813d8604>] dump_stack+0x44/0x60
[ 54.522924] [<ffffffff8120272e>] dump_header+0x5a/0x1c5
[ 54.522926] [<ffffffff81376bd8>] ? apparmor_capable+0xb8/0x120
[ 54.522928] [<ffffffff8118b472>] oom_kill_process+0x202/0x3b0
[ 54.522929] [<ffffffff8118b885>] out_of_memory+0x215/0x460
[ 54.522931] [<ffffffff81191740>] __alloc_pages_nodemask+0x9b0/0xb40
[ 54.522933] [<ffffffff811da7cc>] alloc_pages_current+0x8c/0x110
[ 54.522934] [<ffffffff81187d75>] __page_cache_alloc+0xb5/0xc0
[ 54.522935] [<ffffffff81189f4a>] filemap_fault+0x14a/0x3f0
[ 54.522937] [<ffffffff811b6140>] __do_fault+0x50/0xe0
[ 54.522938] [<ffffffff811b9b82>] handle_mm_fault+0xf92/0x1840
[ 54.522939] [<ffffffff812526a7>] ? eventfd_ctx_read+0x67/0x210
[ 54.522941] [<ffffffff81068517>] __do_page_fault+0x197/0x400
[ 54.522942] [<ffffffff810687a2>] do_page_fault+0x22/0x30
[ 54.522944] [<ffffffff8180e2f8>] page_fault+0x28/0x30
[ 54.522945] Mem-Info:
[ 54.522947] active_anon:788399 inactive_anon:33532 isolated_anon:0
active_file:83 inactive_file:37 isolated_file:0
unevictable:1 dirty:10 writeback:0 unstable:0
slab_reclaimable:5166 slab_unreclaimable:13868
mapped:5646 shmem:9752 pagetables:4476 bounce:0
free:7576 free_pcp:0 free_cma:0
[ 54.522948] Node 0 DMA free:15476kB min:28kB low:32kB high:40kB active_anon:144kB inactive_anon:216kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:80kB shmem:80kB slab_reclaimable:0kB slab_unreclaimable:48kB kernel_stack:0kB pagetables:4kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[ 54.522951] lowmem_reserve[]: 0 2072 3862 3862
[ 54.522952] Node 0 DMA32 free:11220kB min:4204kB low:5252kB high:6304kB active_anon:1711968kB inactive_anon:80964kB active_file:236kB inactive_file:100kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2206296kB managed:2125964kB mlocked:0kB dirty:36kB writeback:0kB mapped:17948kB shmem:26240kB slab_reclaimable:8988kB slab_unreclaimable:26036kB kernel_stack:2656kB pagetables:9348kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3776 all_unreclaimable? yes
[ 54.522955] lowmem_reserve[]: 0 0 1790 1790
[ 54.522956] Node 0 Normal free:3608kB min:3628kB low:4532kB high:5440kB active_anon:1441484kB inactive_anon:52948kB active_file:96kB inactive_file:48kB unevictable:4kB isolated(anon):0kB isolated(file):0kB present:1900544kB managed:1833172kB mlocked:4kB dirty:4kB writeback:0kB mapped:4556kB shmem:12688kB slab_reclaimable:11676kB slab_unreclaimable:29388kB kernel_stack:2448kB pagetables:8552kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:924 all_unreclaimable? yes
[ 54.522958] lowmem_reserve[]: 0 0 0 0
[ 54.522959] Node 0 DMA: 7*4kB (UME) 3*8kB (UM) 4*16kB (UME) 4*32kB (UME) 2*64kB (U) 4*128kB (UME) 1*256kB (E) 2*512kB (ME) 3*1024kB (UME) 1*2048kB (E) 2*4096kB (M) = 15476kB
[ 54.522965] Node 0 DMA32: 118*4kB (UME) 36*8kB (UME) 62*16kB (UME) 94*32kB (UME) 34*64kB (UME) 24*128kB (UME) 5*256kB (UE) 1*512kB (U) 0*1024kB 0*2048kB 0*4096kB = 11800kB
[ 54.522969] Node 0 Normal: 151*4kB (UME) 39*8kB (UME) 77*16kB (UME) 38*32kB (UME) 9*64kB (ME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3940kB
[ 54.522974] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 54.522974] Node 0 hugepages_total=256 hugepages_free=256 hugepages_surp=0 hugepages_size=2048kB
[ 54.522975] 9932 total pagecache pages
[ 54.522976] 0 pages in swap cache
[ 54.522976] Swap cache stats: add 1831590, delete 1831590, find 5929/10969
[ 54.522977] Free swap = 0kB
[ 54.522977] Total swap = 0kB
[ 54.522978] 1030706 pages RAM
[ 54.522978] 0 pages HighMem/MovableOnly
[ 54.522979] 36950 pages reserved
[ 54.522979] 0 pages cma reserved
[ 54.522979] 0 pages hwpoisoned
[ 54.522980] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 54.522986] [ 285] 0 285 10173 1022 23 3 0 0 systemd-journal
[ 54.522988] [ 312] 0 312 11192 266 22 3 0 -1000 systemd-udevd
[ 54.522989] [ 623] 100 623 25590 569 20 4 6 0 systemd-timesyn
[ 54.522990] [ 823] 0 823 5859 1723 14 3 0 0 dhclient
[ 54.522991] [ 917] 0 917 7152 96 18 3 2 0 systemd-logind
[ 54.522992] [ 936] 0 936 6310 223 16 3 0 0 smartd
[ 54.522993] [ 943] 0 943 112847 523 72 3 9 0 NetworkManager
[ 54.522993] [ 952] 0 952 84334 421 68 4 0 0 ModemManager
[ 54.522994] [ 957] 0 957 4797 40 15 4 0 0 atd
[ 54.522995] [ 961] 115 961 93456 912 80 4 0 0 whoopsie
[ 54.522996] [ 963] 0 963 4865 65 13 3 0 0 irqbalance
[ 54.522997] [ 964] 104 964 65667 224 30 4 9 0 rsyslogd
[ 54.522998] [ 966] 0 966 23282 34 13 3 0 0 lxcfs
[ 54.522999] [ 971] 105 971 10926 318 26 3 8 -900 dbus-daemon
[ 54.523000] [ 1008] 0 1008 9570 82 25 3 0 0 cgmanager
[ 54.523001] [ 1016] 0 1016 70808 240 41 3 0 0 accounts-daemon
[ 54.523002] [ 1019] 0 1019 1119 46 8 3 0 0 ondemand
[ 54.523003] [ 1022] 0 1022 7233 68 20 3 0 0 cron
[ 54.523004] [ 1028] 109 1028 11218 97 26 3 3 0 avahi-daemon
[ 54.523005] [ 1030] 0 1030 1807 20 10 3 0 0 sleep
[ 54.523006] [ 1037] 109 1037 11185 82 25 3 0 0 avahi-daemon
[ 54.523007] [ 1047] 0 1047 141966 2188 156 4 3 0 libvirtd
[ 54.523008] [ 1053] 0 1053 13902 163 33 3 0 -1000 sshd
[ 54.523009] [ 1057] 0 1057 69683 586 40 3 12 0 polkitd
[ 54.523010] [ 1072] 0 1072 10963 134 24 3 0 0 wpa_supplicant
[ 54.523011] [ 1081] 0 1081 87582 696 39 3 23 0 lightdm
[ 54.523012] [ 1088] 0 1088 99946 6138 97 3 15 0 Xorg
[ 54.523012] [ 1111] 0 1111 1099 45 8 3 0 0 acpid
[ 54.523013] [ 1125] 0 1125 56533 191 47 4 14 0 lightdm
[ 54.523014] [ 1129] 114 1129 11957 850 27 3 0 0 systemd
[ 54.523015] [ 1130] 114 1130 15825 501 33 3 0 0 (sd-pam)
[ 54.523029] [ 1136] 114 1136 30728 108 26 4 0 0 gnome-keyring-d
[ 54.523030] [ 1138] 114 1138 1119 20 8 3 0 0 lightdm-greeter
[ 54.523031] [ 1143] 114 1143 10743 145 25 3 13 0 dbus-daemon
[ 54.523032] [ 1144] 114 1144 227063 2039 170 4 17 0 unity-greeter
[ 54.523032] [ 1146] 114 1146 84488 626 34 3 0 0 at-spi-bus-laun
[ 54.523033] [ 1151] 114 1151 10680 97 27 4 0 0 dbus-daemon
[ 54.523034] [ 1153] 114 1153 51706 157 37 3 3 0 at-spi2-registr
[ 54.523035] [ 1159] 114 1159 68584 154 37 3 0 0 gvfsd
[ 54.523036] [ 1164] 114 1164 85325 145 32 3 0 0 gvfsd-fuse
[ 54.523037] [ 1174] 114 1174 44626 121 23 3 3 0 dconf-service
[ 54.523038] [ 1197] 0 1197 20665 147 44 3 0 0 lightdm
[ 54.523038] [ 1201] 114 1201 11465 160 27 3 0 0 upstart
[ 54.523039] [ 1204] 114 1204 144936 1323 136 4 4 0 nm-applet
[ 54.523040] [ 1206] 114 1206 88647 256 41 3 26 0 indicator-messa
[ 54.523041] [ 1207] 114 1207 83323 127 31 3 0 0 indicator-bluet
[ 54.523042] [ 1208] 114 1208 122044 98 37 4 12 0 indicator-power
[ 54.523043] [ 1209] 114 1209 132868 439 75 3 0 0 indicator-datet
[ 54.523044] [ 1210] 114 1210 140272 1504 127 4 1 0 indicator-keybo
[ 54.523045] [ 1211] 114 1211 134142 426 68 4 8 0 indicator-sound
[ 54.523045] [ 1212] 114 1212 189042 260 47 4 0 0 indicator-sessi
[ 54.523046] [ 1218] 114 1218 117391 350 89 4 0 0 indicator-appli
[ 54.523047] [ 1232] 0 1232 7973 81 20 3 11 0 bluetoothd
[ 54.523048] [ 1238] 114 1238 152474 1084 129 3 15 0 unity-settings-
[ 54.523049] [ 1261] 114 1261 104039 719 78 4 0 0 pulseaudio
[ 54.523050] [ 1272] 120 1272 45874 77 24 3 1 0 rtkit-daemon
[ 54.523051] [ 1293] 0 1293 68995 324 53 3 12 0 upowerd
[ 54.523052] [ 1296] 114 1296 15493 366 33 3 0 0 gconfd-2
[ 54.523053] [ 1342] 110 1342 75254 1170 49 3 0 0 colord
[ 54.523054] [ 1429] 113 1429 12484 98 27 3 0 0 dnsmasq
[ 54.523054] [ 1430] 0 1430 12477 94 27 3 0 0 dnsmasq
[ 54.523055] [ 1514] 0 1514 22408 226 49 3 0 0 sshd
[ 54.523056] [ 1570] 1000 1570 11958 853 26 3 0 0 systemd
[ 54.523057] [ 1571] 1000 1571 15825 501 33 3 0 0 (sd-pam)
[ 54.523058] [ 1631] 1000 1631 22408 244 46 3 0 0 sshd
[ 54.523058] [ 1632] 1000 1632 5779 619 16 3 0 0 bash
[ 54.523059] [ 1692] 118 1692 11320 77 25 3 14 0 kerneloops
[ 54.523060] [ 1745] 0 1745 3964 41 13 3 0 0 agetty
[ 54.523061] [ 1768] 125 1768 13192 98 27 3 0 0 dnsmasq
[ 54.523062] [ 2276] 126 2276 32160 388 58 3 0 0 exim4
[ 54.523062] [ 8310] 1000 8310 5508 661 14 3 0 0 stress-ng
[ 54.523063] [ 8311] 1000 8311 5508 49 13 3 0 0 stress-ng-brk
[ 54.523064] [ 8312] 1000 8312 5508 46 13 3 0 0 stress-ng-brk
[ 54.523065] [ 8313] 1000 8313 5508 46 13 3 0 0 stress-ng-brk
[ 54.523065] [ 8314] 1000 8314 5508 46 13 3 0 0 stress-ng-brk
[ 54.523066] [ 8321] 1000 8321 365871 360407 717 4 0 0 stress-ng-brk
[ 54.523067] [ 8322] 1000 8322 239424 233959 470 3 0 0 stress-ng-brk
[ 54.523068] [ 8323] 1000 8323 143599 138152 283 3 0 0 stress-ng-brk
[ 54.523069] [ 8324] 1000 8324 54613 49145 109 3 0 0 stress-ng-brk
[ 54.523070] Out of memory: Kill process 8321 (stress-ng-brk) score 363 or sacrifice child
[ 54.523072] Killed process 8321 (stress-ng-brk) total-vm:1463484kB, anon-rss:1441628kB, file-rss:0kB
However, this message may have been cleared from the kernel log, so one may need to inspect the kernel logs /var/log/kern.log*
The default virtual memory setting for Linux is to over-commit memory. This means the kernel will allow one to allocate more memory than is available, allowing processes to memory map large regions because normally not all the pages in the allocation are used. However, sometimes a process will read/write to all the pages that are over committed and the kernel cannot provide enough physical memory + swap, so the OOM killer attempts to find the best candidate overcommitted process and kill it.
So, if you want to see the kernel log immediately the job is killed, wrap it with the following bash script:
#!/bin/bash
your_job_here
ret=$?
#
# returns > 127 are a SIGNAL
#
if [ $ret -gt 127 ]; then
sig=$((ret - 128))
echo "Got SIGNAL $sig"
if [ $sig -eq $(kill -l SIGKILL) ]; then
echo "process was killed with SIGKILL"
dmesg > $HOME/dmesg-kill.log
fi
fi
Note: "your_job_here" is the name of the program/job you want to run. This script checks the return code of the program and will check if it was killed with a SIGKILL and if so, will dump the dmesg immediately afterwards to your home directory in a file called dmesg-kill.log
Hope that helps