Fiber multipath fails: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
This is an HP ProLiant DL380 Gen9 server. Pretty standard enterprise-class server.
Can you give me information on the server's firmware revision?
Is EMC PowerPath actually installed? If so, check here.
Do you have the HP Management Agents installed? If so, do you have the ability to post the output of hplog -v
.
Have you seen anything in the ILO4 log? Is the ILO accessible?
Can you describe all of the PCIe cards installed in the system's slots?
For RHEL6-specific tuning, I highly recommend XFS, running tuned-adm profile enterprise-storage
and ensuring your filesystems are mounted nobarrier
(the tuned profile should handle that).
For the volumes, please ensure that you're using the dm
(multipath) devices instead of /dev/sdX
. See: https://access.redhat.com/solutions/1212233
Looking at what you've presented so far and the check listed at Redhat's support site (and the description here), I can't rule out the potential for HBA failure or PCIe riser problems. Also, there's a slight possibility that there's an issue on the VMAX side.
Can you swap PCIe slots and try again? Can you swap cards and try again?
Is the firmware on the HBA current? Here's the most recent package from December 2016.
Firmware 6.07.02 BIOS 3.21
A DID_ERROR typically indicates the driver software detected some type of hardware error via an anomaly within the returned data from the HBA.
A hardware or san-based issue is present within the storage subsystem such that received fibre channel response frames contain invalid or conflicting information that the driver is not able to use or reconcile.
Please review the systems hardware, switch error counters, etc. to see if there is any indication of where the issue might lie. The most likely candidate is the HBA itself.
This looks to me like one of your SFPs has soft-failed... Look in your storage switch for errors on the port while you are doing a large copy.
I had a similar issue recently where everything looked great. Server vendor signed off on their stuff, storage vendor said their stuff looks good, swore the SFPs are all fine... SFP still showed as up and functional, until large amounts of data were sent across the MPIO interface and lots of errors on the storage switch port would start getting logged.
I had to replace all fiber cables with new ones, then switch SFPs with spares I had on hand to prove to the vendor that the SFP was bad, even though it appeared fine otherwise.