in the final assignment part 1, I am unable to figure out why we need dA0 in L_model_backward(AL, Y, caches)
function. In the given test case, looks we have 2 layers and the final output AL is from layer 2. dAL would dA2 and then we only need to calculate dA1 But the code given goes on printing dA0. where did A0 come from since A0 is essentially the initial input. From what I understand we don’t calculate the gradient of the input layer.
This topic addresses the issue. Can we use dA0 to do something cool?
Yes, we don’t need dA0, but the fact that we end up calculating dA0 is just an “artifact” of the way back prop works. At each layer, we take dA^{[l]} as input and compute dW^{[l]}, db^{[l]} and dA^{[l-1]}. Of course we need the dW and db values for layer 1, so we get dA0 as a side effect. On the other thread you link, there was some speculation about whether there is actually any information you could glean from the gradients of the inputs, but it is just that (speculation).
1 Like