Why do we use gradient descent in the backpropagation algorithm?

Backpropagation algorithm IS gradient descent and the reason it is usually restricted to first derivative (instead of Newton which requires hessian) is because the application of chain rule on first derivative is what gives us the "back propagation" in the backpropagation algorithm. Now, Newton is problematic (complex and hard to compute), but it does not stop us from using Quasi-newton methods especially BFGS (I believe many neural network software packages already use BFGS as part of their training these days).

As for fixed learning rate, it need not be fixed at all. There are papers far back as '95 reporting on this (Search for "adaptive learning rate backpropagation").


I think there is a mistake in the question by saying the use of "gradient descent" in backpropogation algorithm, they are two completely different things, let me make their definitions clear to you:

Backpropogation is an efficeint technique in providing a computationally efficient method for evaluating of derivatives in a network. The term backpropagation is specifically used to describe the evaluation of derivatives. Now these derivatives are used to make adjustments to the weights, the simplest such technique is gradient descent.

It is important to recognize that the two stages are distinct. Thus, the first stage, namely the propagation of errors backwards through the network (i.e. backpropogation) in order to evaluate derivatives, can be applied to many other kinds of network and not just the multilayer perceptron. Similarly, the second stage of weight adjustment using the calculated derivatives can be tackled using a variety of optimization schemes, many of which are substantially more powerful than simple gradient descent like methods having adaptive learning rates eg techniques like Nestrov mommentum, Adam etc. Hope this answers first part of the question.

Now there is definitely a good reason to use gradient descent or any another algorithm with adaptive learning rate over second order methods like Newton's method, which is that the application of Newton’s method for training large neural networks is limited by the significant computational burden it imposes. The number of elements in the Hessian matrix is squared in the number of parameters, so with k parameters (and for even very small neural networks the number of parameters k can be in the millions), Newton’s method would require the inversion of a k × k matrix—with computational complexity of O(k^3). Also, since the parameters will change with every update, the inverse Hessian has to be computed at every training iteration. As a consequence, only networks with a very small number of parameters can be practically trained via Newton’s method