What does "The IO operation at logical block address # for Disk # was retried." mean when seen in the Windows Server System event log?

Solution 1:

No it does not mean that the data was lost. It simply means that the IRP (IO Request Packet) timed out while the IO System waited for it to complete, and so it was tried again. When a thread begins any IO operation, the IO manager creates an IRP to represent the operation as it passes through the system.

The IRP gets stored in its initial state in a buffer/look-aside list, so that it can be retried if it fails the first time. That provides the atomicity that one would expect from any transactional system so that we can be more confident that you're not going to get a bunch of corrupted or incomplete data written to your disk.

This event makes perfect sense in the event of an MPIO failure. Say Windows goes to read or write something from SAN storage. The request is dispatched, and at the same instant, I cut one of the cables to the SAN. That request is never going to complete, and so Windows will try the request again, only this time the request will follow the other path.

These events also occur when the disks are overburdened or just really slow. You might notice these messages coincide with scheduled backups, etc. The disk might just be slow and busy, and some random IRP timed out and had to try again. The IRP could be getting stuck in an interrupt service routine, or a deferred procedure call, or whatever.

I could see having a lot of IO filter drivers in your stack exacerbating this issue as well.

It's not that this behavior did not occur just like this in previous versions of Windows, it's just that Microsoft apparently decided to surface these events in Win8/Server 2012.

Edit: You can find the outstanding IRPs of a thread with a kernel debugger: kd> !irp 1a2b3c4d, where you previously found that address by issuing the command kd> !process 8f7d6c4a which will list all the IRPs associated to the threads associated with that process. kd> !process 0 0 to list all the processes running.

Once you list the information about an IRP using the !irp command, you can easily spot which driver last handled the IRP because it will have a > pointing to it in the list. Then to get more information about what that driver was doing with that IRP, do a kd> !devobj 1a2b3c4d5e6f where that is the actual address of the device object.

Then do a kd> dt 0x1a2b3c3c2b1a _CLASS_PRIVATE_FDO_DATA using the address of the PrivateFdoData structure you got.

Now you're ready to dump the AllTransferPacketsList data structure you got from PrivateFdoData.

The idea is, you're tracking down what driver was doing what with the IRP the last time it was seen. If the IRP is AWOL for too long, it's timed out and retried from the beginning. This can be caused by so many things... even a stray cosmic ray. But the important thing is that the transaction will be retried from the beginning, and it will not be considered complete until the IO manager says it is.

Oh, and there's also thread-agnostic IO which is a completely different can of worms. :)

For further reading on this topic, I highly recommend chapter 8, I/O System, of Windows Internals 6th edition, from Mark Russinovich, Margosis, et al.

**Edit: ** I did finally find the official KB for this error: http://support.microsoft.com/kb/2819485/EN-US

The IO operation should be retried 8 times, once per minute, until Windows gives up.

Edit: As promised: https://docs.microsoft.com/en-us/archive/blogs/ntdebugging/interpreting-event-153-errors

Solution 2:

No, there would be a different message, and (hopefully) one of the application layers would throw an exception if it failed to successfully save data.

Prior to Windows Server 2012 (or hotfix 2819485 if on Windows Server 2008 R2), the system would silently retry when these timeouts occurred. The purpose of the message is to increase visibility about these occurrences. They may indicate a capacity issue or driver defect, and in the case of iSCSI, other operating system defects may attribute to the delay.

In the case of external (not direct-attached) storage, some vendors in the past have increased the timeout value, for example to 60 seconds. However, given the default number of retries by higher layer components such as the iSCSI initiator, this could mean that several minutes may elapse before the system initiated a failover. That would obviously be suboptimal behavior.

More information:

Registry Entries for SCSI Miniport Drivers
http://msdn.microsoft.com/en-us/library/windows/hardware/ff563970%28v=vs.85%29.aspx

https://docs.microsoft.com/en-us/archive/blogs/san/the-windows-disk-timeout-value-less-is-better


Microsoft has released an update that provides the capability to specify the threshold for storport.sys operations.

After you install this update, you can log an event when the latency time for I/O to storage is equal to, or greater than, a threshold. The threshold value can be set by the user. This operation is performed at the Adapter Driver level so that you can see if there is a performance issue on the SAN. Then, you can contact a storage vendor to address the issue.

Note: This update restores the functionality that was provided in Windows 7 and Windows Server 2008 R2. When the functionality is enabled, the threshold value is measured in 100 nanoseconds (0.0001 milliseconds). Additionally, the following values are logged in the event:

BuildIoDuration: Length of time that the MINIPORT has spent in the build I/O function for this request StartIoDuration: Length of time that the MINIPORT has spent in the start I/O function for this request DataTransferLength: Size of the transfer in bytes

Update that improves the logging capabilities of the Storport.sys driver in Windows Server 2012
http://support.microsoft.com/kb/2819476

Windows 8 and Windows Server 2012 cumulative update: April 2013
http://support.microsoft.com/kb/2822241


Solution 3:

Might be a late post, but I have found that it can be caused with VSS. We had a client who was running veeam but had forgot to turn off windows server back up (the disk was removed) It caused a shed load of problems and this error was the main one.

Stopped the back up and wham, no errors.