Can anyone xplain why dW,db mean the gradients of cost function J whereas dA is the gradient of loss function l w.r.t A? Why dJ/dA is not calculated
dAL here is just dL/dAL. If it was dJ/dAL then where is 1/m
dAL here is just dL/dAL. If it was dJ/dAL then where is 1/m
Your neural network parameters are W and b. A is produced in the intermediate layers by the neural network when W, b and x are known. Gradient with respect to A only helps when we apply chain rule. Otherwise the training process updates only W and b, which in turn update A.
Yes thats true but i am asking why gardient dAL is considered as dL/dAL (derivative of loss funtion with AL)? as dW is dJ/dW(derivative of cost function w.r.t W)
\frac{\partial L}{\partial W^{[l-1]}} = \frac{\partial L}{\partial A^{[l-1]}}\frac{\partial A^{[l-1]}}{\partial W^{[l-1]}}
\frac{\partial A^{[l-1]}}{\partial W^{[l-1]}} is easy to compute - therefore we calculate it for backprop in multi layer neural networks.
thanks for taking your time…i think you did understood my question…none the less i guess i made some deduction…and anyone can correct me on this:
It’s a good question that has come up before. Here’s an earlier thread that discusses the same points.
The point is that most of the formulas Prof Ng shows are for “layer” level Chain Rule factors and the \frac {1}{m} only comes in when you finally put all the Chain Rule factors together to compute the actual gradients of the weight or bias values. You could have structured things differently, but you need to make sure you don’t end up with multiple factors of \frac {1}{m}.
Of course computing that last factor \displaystyle \frac {\partial J}{\partial L} is easy: the gradient of the average is the average of the gradients. Think about it for a second and that should make sense.
thanks for the explanation. So can you correct me if i am wrong.
So as dW=dJ/dW==1/mdZA.T involves summation over all training examples and in which dZ=𝑑𝑍[𝑙]=𝑑𝐴[𝑙]∗𝑔′(𝑍[𝑙]).(11) invloves the term dA. Then we basically compute dA=dL/dA for each training example which indeed will be substituted in dJ/dW which involves summation over all training examples(so we dont end up in multiple 1/m factors)…
I had one more doubt… so will building a neural network first we have to deduct
the gradients of the parameters(vectorized) of the model by beforehand (manually)based on the cost function we choose and activations. That is as in this course we mostly used cross entropy loss and repectively we found the gradients . What if we had another loss function the whole gradient formulaes changes right?