I have a question regarding this lecture, just asking out of curiosity, please feel free to ignore it if it does not makes sense.
In this video at 5:25 timestamp profesor Andrew Ng explains that you could calculate da0
which is the gradient from the 1st hidden-layer to the input-layer X.
My questions are,
- can this
da0
value considered some sort of “neural-network” error
- and are there any techniques to leverage this error in our network and somehow forward propagate this “backward propagation” error
The point Prof Ng is making here is that it is just an “artifact” of the way back propagation works that you end up generating a gradient for A0 (X) at the first hidden layer. Of course the point of running back propagation at that layer is that you do need the gradients for W1 and b1. And just because of the way the general algorithm works, you end up generating dA0 as a side effect. But there is literally no use for that value: what would it mean to “improve” the inputs? The inputs are the inputs: that’s kind of the point, right? So we simply ignore dA0. Of course dA1 and dA2 (etc) are used in the calculations for the relevant layers.
1 Like
That makes sense, thanks for explaining @paulinpaloalto .
I suppose in a well-functioning network dA0
is/should be relatively small (since the network fits the input-data quite well)? at least that what I’d think intuitively
That’s an interesting intuition! You could instrument your code to see whether that happens or not. E.g. compute the 2-norm of dA0 as a measure of how “big” the gradients are in aggregate and see if it decreases as the training converges to a better and better solution. It would be interesting to know if you learn anything from that type of investigation!
But the overall point is that we have no direct use for dA0 and just end up ignoring it.