Determining cause of Linux kernel panic
I have two suggestions to start.
The first you're not going to like. No matter how stable you think your overclocked system is, it would be my first suspect. And any developer you report the problem to will say the same thing. Your stable test workload isn't necessarily using the same instructions, stressing the memory subsystem as much, whatever. Stop overclocking. If you want people to believe the problem's not overclocking, then make it happen when not overclocking so you can get a clean bug report. This will make a huge difference in how much effort other people will invest in solving this problem. Having bug-free software is a point of pride, but reports from people with particularly questionable hardware setups are frustrating time-sinks that probably don't involve a real bug at all.
The second is to get the oops data, which as you've noticed doesn't go to any of the places you've mentioned. If the crash only happens while running X11, I think local console is pretty much out (it's a pain anyway), so you need to do this over a serial console, over the network, or by saving to local disk (which is trickier than it may sound because you don't want an untrustworthy kernel to corrupt your filesystem). Here are some ways to do this:
- use netdump to save to a server over the network. I haven't done this in years, so I'm not sure this software is still around and working with modern kernels, but it's easy enough that it's worth a shot.
- boot using a serial console; you'll need a serial port free on both machines (whether an old-school one or a USB serial adapter) and a null modem cable; you'd configure the other machine to save the output.
- kdump seems to be what the cool kids use nowadays, and seems quite flexible, although it wouldn't be my preference because it looks complex to set up. In short, it involves booting a different kernel that can do anything and inspect the former kernel's memory contents, but you have to essentially build the whole process and I don't see a lot of canned options out there. Update: There are some nice distro things, actually; on Ubuntu, linux-crashdump
Once you get the debug info, there's a tool called ksymoops that you can use to turn the addresses into symbol names and start getting an idea how your kernel crashed. And if the symbolized dump doesn't mean anything to you, at least this is something helpful to report here or perhaps on your Linux distribution's mailing list / bug tracker.
From crash
on your crashdump, you can try typing log
and bt
to get a bit more information (things logged during the panic and a stack backtrace). Your Fatal Machine check
seems to be coming from here, though. From skimming the code, your processor has reported a Machine Check Exception - a hardware problem. Again, my first bet would be due to overclocking. It seems like there might be a more specific message in the log
output which could tell you more.
Also from that code, it looks like if you boot with the mce=3
kernel parameter, it will stop crashing...but I wouldn't really recommend this except as a diagnostic step. If the Linux kernel thinks this error is worth crashing over, it's probably right.
a) Check if kernel messages are being logged to a file by rsyslog daemon
vi /etc/rsyslog.conf
And add the following
kern.* /var/log/kernel.log
Restart the rsyslog
service.
/etc/initd.d/rsyslog restart
b) Take a note of the loaded modules
`lsmod >/your/home/dir`
c) As the panic is not reproducible, wait for it to happen
d) Once the panic has occurred, boot the system using a live or emergency CD
e) Mount the filesystems (usually / will suffice if /var and /home are not separate file systems) of the affected system (pvs
, vgs
, lvs
commands need to be run if you are using LVM on the affected system to bring up the LV)
mount -t ext4 /dev/sdXN /mnt
f) Go to /mnt/var/log/
directory and check the kernel.log
file. This should give you enough information to figure out if the panic is happening for a particular module or something else.
Is your processor overclocked? I had this same issue today when I was playing with the multiplier in the over-clocking menu in my BIOS; various multipliers around 20x would cause this to happen. I reduced it down to 18.5x (3.7GHz) and the problem went away; I think it was a motherboard/power issue.