Is non-blocking I/O really faster than multi-threaded blocking I/O? How?

The biggest advantage of nonblocking or asynchronous I/O is that your thread can continue its work in parallel. Of course you can achieve this also using an additional thread. As you stated for best overall (system) performance I guess it would be better to use asynchronous I/O and not multiple threads (so reducing thread switching).

Let's look at possible implementations of a network server program that shall handle 1000 clients connected in parallel:

  1. One thread per connection (can be blocking I/O, but can also be non-blocking I/O).
    Each thread requires memory resources (also kernel memory!), that is a disadvantage. And every additional thread means more work for the scheduler.
  2. One thread for all connections.
    This takes load from the system because we have fewer threads. But it also prevents you from using the full performance of your machine, because you might end up driving one processor to 100% and letting all other processors idle around.
  3. A few threads where each thread handles some of the connections.
    This takes load from the system because there are fewer threads. And it can use all available processors. On Windows this approach is supported by Thread Pool API.

Of course having more threads is not per se a problem. As you might have recognized I chose quite a high number of connections/threads. I doubt that you'll see any difference between the three possible implementations if we are talking about only a dozen threads (this is also what Raymond Chen suggests on the MSDN blog post Does Windows have a limit of 2000 threads per process?).

On Windows using unbuffered file I/O means that writes must be of a size which is a multiple of the page size. I have not tested it, but it sounds like this could also affect write performance positively for buffered synchronous and asynchronous writes.

The steps 1 to 7 you describe give a good idea of how it works. On Windows the operating system will inform you about completion of an asynchronous I/O (WriteFile with OVERLAPPED structure) using an event or a callback. Callback functions will only be called for example when your code calls WaitForMultipleObjectsEx with bAlertable set to true.

Some more reading on the web:

  • Multiple Threads in the User Interface on MSDN, also shortly handling the cost of creating threads
  • Section Threads and Thread Pools says "Although threads are relatively easy to create and use, the operating system allocates a significant amount of time and other resources to manage them."
  • CreateThread documentation on MSDN says "However, your application will have better performance if you create one thread per processor and build queues of requests for which the application maintains the context information.".
  • Old article Why Too Many Threads Hurts Performance, and What to do About It

I/O includes multiple kind of operations like reading and writing data from hard drives, accessing network resources, calling web services or retrieving data from databases. Depending on the platform and on the kind of operation, asynchronous I/O will usually take advantage of any hardware or low level system support for performing the operation. This means that it will be performed with as little impact as possible on the CPU.

At application level, asynchronous I/O prevents threads from having to wait for I/O operations to complete. As soon as an asynchronous I/O operation is started, it releases the thread on which it was launched and a callback is registered. When the operation completes, the callback is queued for execution on the first available thread.

If the I/O operation is executed synchronously, it keeps its running thread doing nothing until the operation completes. The runtime doesn't know when the I/O operation completes, so it will periodically provide some CPU time to the waiting thread, CPU time that could have otherwise be used by other threads that have actual CPU bound operations to perform.

So, as @user1629468 mentioned, asynchronous I/O does not provide better performance but rather better scalability. This is obvious when running in contexts that have a limited number of threads available, like it is the case with web applications. Web application usually use a thread pool from which they assign threads to each request. If requests are blocked on long running I/O operations there is the risk of depleting the web pool and making the web application freeze or slow to respond.

One thing I have noticed is that asynchronous I/O isn't the best option when dealing with very fast I/O operations. In that case the benefit of not keeping a thread busy while waiting for the I/O operation to complete is not very important and the fact that the operation is started on one thread and it is completed on another adds an overhead to the overall execution.

You can read a more detailed research I have recently made on the topic of asynchronous I/O vs. multithreading here.