Suddenly slow RAID performance
As Michal pointed out in his comment, the issue was a "prefailing" disk. There were no red flags in the diagnostics from the megaraid controller and smartctl's SMART Health Status:
was OK
, but running smartctl
on each disk revealed a huge Non-medium error count (I wrote a quick bash script to loop through each disk ID). Here's the relevant bits from the full output:
# Ran this for each individual disk on the /dev/sdb array:
smartctl -a -d megaraid,18 /dev/sdb
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 7950078 0 0 7950078 7950078 660.801 0
write: 0 0 0 0 0 363.247 0
verify: 12 0 0 12 12 0.002 0
Non-medium error count: 3253718
Every other drive showed a non-medium error count of 0, except for this one (disk ID 18). I identified the disk, swapped it with a new one, and am back to getting 3gbps reads.
According to smartmontools wiki:
The displayed error logs (if available) are displayed on separate lines:
write error counters
read error counters
verify error counters (only displayed if non-zero)
non-medium error counter (only a single number displayed). This represents the number of recoverable events other than write, read or verify errors.
error events are held in the "Last n error events" log page. The number of error event records held (i.e. "n") is vendor specific (e.g. up to 23 records are held for Hitachi 10K300 model disks). The contents of each error event record is in ASCII and vendor specific. The parameter code associated with each error event record indicates the relative time at which the error event occurred. A higher parameter code indicates that the error event occurred later in time. If this log page is not supported by the device then "Error Events logging not supported" is output. If this log page is supported and there are error event records then each one is prefixed by "Error event :" where is the parameter code.