Derivative of activation function and use in backpropagation

Training a neural network just refers to finding values for every cell in the weight matrices (of which there are two for a NN having one hidden layer) such that the squared differences between the observed and predicted data are minimized. In practice, the individual weights comprising the two weight matrices are adjusted with each iteration (their initial values are often set to random values). This is also called the online model, as opposed to the batch one where weights are adjusted after a lot of iterations.

But how should the weights be adjusted--i.e., which direction +/-? And by how much?

That's where the derivative come in. A large value for the derivative will result in a large adjustment to the corresponding weight. This makes sense because if the derivative is large that means you are far from a minima. Put another way, weights are adjusted at each iteration in the direction of steepest descent (highest value of the derivative) on the cost function's surface defined by the total error (observed versus predicted).

After the error on each pattern is computed (subtracting the actual value of the response varible or output vector from the value predicted by the NN during that iteration), each weight in the weight matrices is adjusted in proportion to the calculated error gradient.

Because the error calculation begins at the end of the NN (i.e., at the output layer by subtracting observed from predicted) and proceeds to the front, it is called backprop.


More generally, the derivative (or gradient for multivariable problems) is used by the optimization technique (for backprop, conjugate gradient is probably the most common) to locate minima of the objective (aka loss) function.

It works this way:

The first derivative is the point on a curve such that a line tangent to it has a slope of 0.

So if you are walking around a 3D surface defined by the objective function and you walk to a point where slope = 0, then you are at the bottom--you have found a minima (whether global or local) for the function.

But the first derivative is more important than that. It also tells you if you are going in the right direction to reach the function minimum.

It's easy to see why this is so if you think about what happens to the slope of the tangent line as the point on the curve/surface is moved down toward the function minimumn.

The slope (hence the value of the derivative of the function at that point) gradually decreases. In other words, to minimize a function, follow the derivative--i.e, if the value is decreasing then you are moving in the correct direction.


The weight update formula you cite isn't just some arbitrary expression. It comes about by assuming an error function and minimizing it with gradient descent. The derivative of the activation function is thus there because, essentially, of the chain rule of calculus.

Books on neural networks are more likely to have the derivation of the update rule in backpropagation. For example, Introduction to the theory of neural computation by Hertz, Krogh, and Palmer.