Understanding Neural Network Backpropagation
What I read from Step 3's equation is:
- O_h = last output of this hidden unit (O_h on the input layer is the actual input value)
- w_kh = weight of connection between this hidden unit and a unit of the next layer (towards output)
- delta_k = error of unit of the next layer (towards output, same unit as previous bullet)
Each unit has only one output, but each link between the output and the next layer is weighted. So the output is the same, but on the receiving end, each unit will receive a different value if the weight of the links is different. O_h always refers to the value of this neuron for the last iteration. Error does not apply to the input layer, as by definition, the input has no 'error' per se.
The error needs to be calculated layer by layer, starting at the output side, since we need the error values of layer N+1 to calculate layer N. You are right, there is no direct connection between input and output in backpropagation.
I believe the equation is correct, if counterintuitive. What is probably confusing is that in forward propagation for each unit we have to consider all the units and links on the left of the unit (input values), but for error propagation (backpropagation) was have to consider the units on the right (output value) of the unit being processed.
I'm not sure what your question is but I actually went through that tutorial myself and I can assure you, other than a one obvious typo, there is nothing incorrect about it.
I will make the assumption that your question is because you are confused about how the backpropagation hidden delta is derived. If this is indeed your question then please consider
(source: pandamatak.com)
You are probably confused as to how the author derived this equation. This is actually a straightforward application of the multivariate chain rule. Namely, (what follows is taken from wikipedia)
"Suppose that each argument of z = f(u, v) is a two-variable function such that u = h(x, y) and v = g(x, y), and that these functions are all differentiable. Then the chain rule would look like:
"
Now imagine extending the chain rule by an induction argument to
E(z'1,z'2,..,z'n) where z'k is the output of the kth output layer pre-activation, and z'k(wji) that is to say that E is a function of the z' and z' itself is a function of wji (if this doesn't make sense to you at first think very carefully about how a NN is setup.) Applying the chain rule directly extended to n variables:
δE(z'1,z'2,..,z'n)/δwji = ΣkδE/δz'kδz'k/δwji
that is the most important step, the author then applies the chain rule again, this time within the sum to expand the δz'k/δwji term, that is
δz'k/δwji = δz'k/δojδoj/δzjδzj/δwji.
If you have difficulties understanding the chain rule, you may need to take a course on multivariate calculus, or read such a section in a textbook.
Good luck.
The tutorial you posted here is actually doing it wrong. I double checked it against Bishop's two standard books and two of my working implementations. I will point out below where exactly.
An important thing to keep in mind is that you are always searching for derivatives of the error function with respect to a unit or weight. The former are the deltas, the latter is what you use to update your weights.
If you want to understand backpropagation, you have to understand the chain rule. It's all about the chain rule here. If you don't know how it works exactly, check up at wikipedia - it's not that hard. But as soon as you understand the derivations, everything falls into place. Promise! :)
∂E/∂W can be composed into ∂E/∂o ∂o/∂W via the chain rule. ∂o/∂W is easily calculated, since it's just the derivative of the activation/output of a unit with respect to the weights. ∂E/∂o is actually what we call the deltas. (I am assuming that E, o and W are vectors/matrices here)
We do have them for the output units, since that is where we can calculate the error. (Mostly we have an error function that comes down to delta of (t_k - o_k), eg for quadratic error function in the case of linear outputs and cross entropy in case for logistic outputs.)
The question now is, how do we get the derivatives for the internal units? Well, we know that the output of a unit is the sum of all incoming units weighted by their weights and the application of a transfer function afterwards. So o_k = f(sum(w_kj * o_j, for all j)).
So what we do is, derive o_k with respect to o_j. Since delta_j = ∂E/∂o_j = ∂E/∂o_k ∂o_k/∂o_j = delta_k ∂o_k/o_j. So given delta_k, we can calculate delta_j!
Let's do this. o_k = f(sum(w_kj * o_j, for all j)) => ∂o_k/∂o_j = f'(sum(w_kj * o_j, for all j)) * w_kj = f'(z_k) * w_kj.
For the case of the sigmoidal transfer function, this becomes z_k(1 - z_k) * w_kj. (Here is the error in the tutorial, the author says o_k(1 - o_k) * w_kj!)