Suddenly slow RAID performance

As Michal pointed out in his comment, the issue was a "prefailing" disk. There were no red flags in the diagnostics from the megaraid controller and smartctl's SMART Health Status: was OK, but running smartctl on each disk revealed a huge Non-medium error count (I wrote a quick bash script to loop through each disk ID). Here's the relevant bits from the full output:

Click to copy

# Ran this for each individual disk on the /dev/sdb array:
smartctl -a -d megaraid,18  /dev/sdb

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    7950078        0         0   7950078    7950078        660.801           0
write:         0        0         0         0          0        363.247           0
verify:       12        0         0        12         12          0.002           0

Non-medium error count:  3253718

Every other drive showed a non-medium error count of 0, except for this one (disk ID 18). I identified the disk, swapped it with a new one, and am back to getting 3gbps reads.

According to smartmontools wiki:

The displayed error logs (if available) are displayed on separate lines:

write error counters

read error counters

verify error counters (only displayed if non-zero)

non-medium error counter (only a single number displayed). This represents the number of recoverable events other than write, read or verify errors.

error events are held in the "Last n error events" log page. The number of error event records held (i.e. "n") is vendor specific (e.g. up to 23 records are held for Hitachi 10K300 model disks). The contents of each error event record is in ASCII and vendor specific. The parameter code associated with each error event record indicates the relative time at which the error event occurred. A higher parameter code indicates that the error event occurred later in time. If this log page is not supported by the device then "Error Events logging not supported" is output. If this log page is supported and there are error event records then each one is prefixed by "Error event :" where is the parameter code.

Suddenly slow RAID performance

Tags:

Performance

Raid

Hardware Raid

Megaraid

Related

Recent Posts