How much faster is NCHW compared to NHWC in TensorFlow/cuDNN?
The reason is that most implementations of simple convolutions (not talking winograd or fft here), end up doing some kind of simple matrix multiplication, which means that in their inner loop they multiply some values from both tensors and sum the results.
On a CPU implementation, using SSE or AVX optimization, it's faster to do this along the C dimension, because you just multiply-add the values 4 by 4 or 8 by 8, and then do the reduction (sum your 4 or 8 accumulations) at the end once you added all the C dimension.
On a GPU however, doing a reduction across threads is a more costly operation (at least it was until Kepler introduced wrap-level atomic operations), so historically it has been optimized so that each thread in a wrap reads consecutive (in memory) HW values, and do the accumulation over parts of C with a loop.
Note though that the latest nvidia cards (RTX), now have tensor multiplication cores, that can process small blocks in one operation, including the reduction over a small portion of C, so on these cards it's actually faster to use NHWC (or hybrid NCHWC formats).
As of TF1.1, you can't even call NHWC directly. TF does the conversion to and from NCHW. So, regardless of the efficiency of the implementation of NHWC in cuDNN, from the TF users' perspective, NCHW is faster:
https://github.com/tensorflow/tensorflow/issues/8286
The performance ratio will of course depend on the problem, but my sense of it is that it's big, and you don't want to use NHWC (on GPU), if you can avoid it (It seems likely that you'd be wasting memory too)