Why does storage's performance change at various queue depths?
Solution 1:
The short answer is that harddrives optimizes retrieval of data if there are more than one IO request outstanding, which usually increases throughput at the expense of latency.
NCQ does this, reorders IO requests to optimize throughput.
SSD:s work differently from mechanical drives since they have parallell flash chips to store data on. Ie if you issue one IO request at a time the latency (search + read time) decides IOPS. But if you issue 4 requests at once the ssd disk might be able to retrieve them in parallell or another optimized way and you might get 4 times the throughput.
The higher the queue depth gets the more possibilities to optimize the disk gets, to a point. Since IOPS is a function of throughput this increases IOPS at higher queue depths.
EDIT:
The true queue resides in the OS which is issuing all requests. That said I would conjecture the controller driver passes on a certain amount of the queue to the controller and disk so they can work at their optimized queue depths. The disk must have it's own queue to be able to optimize.
Solution 2:
Old question, but deserves more info due to the number of times it gets seen. This answer is based on SSDs, since that's what the original question is about.
Queue depth and the variation of IOPS based on queue depth. Consider a queue depth of 1. In this case, the spec given is not based on a drive falling behind a number of requests that the system is generating, but instead means, the system generates 1 request to the SSD. There is a transaction time for a request, so if only 1 request is sent at a time, the SSD can only process 1 request. This should make sense. Because of the transaction time, sending 1 request at a time is slower than sending 32 requests all at once, which is what a spec like QD32 means.
Also, as another comment pointed out, you may be able to do some reading/writing in parallel with an SSD, on PCIe, not SATA.
In the case of the Samsung 970 Pro for instance, QD1 = 55,000, and QD32 = 500,000 IOPS. This is basically because you are sending one request, vs. 32 at once. The transaction time is lessened, so you are mostly dealing with the data transmission. So, minus out a lot of transaction processing, and the actual function of data transmission increases.
So, the spec that's given for the disks is not exactly the same as the definition of queue depth. Queue depth in regards to the system is basically the number of requests that haven't been processed. The spec is based on transactions the system sends the drive at a time. However, if you're dealing with a SAN, the queue depth is basically the number of requests in-flight. So, I'm not quite sure on an exact definition for that term. To me it seems to vary depending on the particular part of the system you're referring to.
As far as the transactions between the OS and the device, the device is going to buffer a certain amount of transactions, and after that, the OS won't send more transactions. There HAS to be a form of handshaking that allows for orderly processing, which means an OS can't send more requests to a drive than it can physically hold. Otherwise you have chaos, and a poorly designed system.
A question like "what happens when you get an incoming request but the queue depth is full" in other words, should never happen in regards to the disk, and the disk doesn't have a "queue depth" to hold the requests, it has a "queue". The physical number of requests that a drive will hold is going to vary based on the type of drive. It can't be too small or the disk can't optimize the read/writes very well, and it can't be too large for many reasons, cost being one, inability to optimize after a certain number of requests are in the queue would probably be another.