How to improve performance without going parallel for my backprop ANN
You want to eliminate the conditional from inside your loop here:
const double lower_layer_output = i > 0 ? outputs[lower_layer][k] : input[k]; // input layer semantics
You can eliminate this condition by calculting the zero'th iteration (the special case of i==0) earlier.
deltas[i][j][k] = delta;
weights[i][j][k] += delta;
You mention using std::vector, so this is a vector of vector of vector? Your data is not going to be contiguous (except in the sense that each vector is contigous). Consider using C style arrays.
How big are those dimensions? There may be some caching considerations if very large. E.g. you don't want that last subscript [k] to flush the L1 cache. Sometimes breaking the loop to process a smaller range of k indexes at a time can help (strip mining).
You can also experiment with unrolling your inner loops a little, e.g. try doing 4 or eight operations inside the loop. Increment by 4/8 respectively and handle any remainder in another loop. The compiler may be doing that already.
As others have mentioned using SIMD (SSE/AVX) is probably where you can find the most gain. You can either use compiler intrinsics (link is to Visual Studio but gcc has support with same syntax) or write in assembly (inlined or otherwise). As you mentioned, scaling across multiple cores is another direction. OpenMP can help you do that without a lot of pain.
Sometimes it is useful to generate an annotated assembly listing from your code to try and see where the compiler isn't doing such a great jobs.
This is an excellent general resource about optimization.
You can't avoid an O(n^2) algorithm if you want to train/use a NN. But it is perfectly suited for vector arithmetic. For example with clever use of SSE or AVX you could process the neurons in chunks of 4 or 8 and use a multiply-add instead of two separate instructions.
If you use a modern compiler and carefully reformulate the algorithm and use the right switches, you might even get the compiler to autovectorize the loops for you, but your mileage may vary.
For gcc, autovectorization is activated using -O3 or -ftree-vectorize. You need an vector capable cpu of course, something like -march=core2 -mssse4.1 or similar, depending on the target cpu. If you use -ftree-vectorizer-verbose=2 you get detailed explanations, why and where loops were not vectorized. Have a look at http://gcc.gnu.org/projects/tree-ssa/vectorization.html .
Better is of course using the compiler intrinsics directly.