week-4-Backpropagation


Can anyone xplain why dW,db mean the gradients of cost function J whereas dA is the gradient of loss function l w.r.t A? Why dJ/dA is not calculated

image
dAL here is just dL/dAL. If it was dJ/dAL then where is 1/m

Your neural network parameters are W and b. A is produced in the intermediate layers by the neural network when W, b and x are known. Gradient with respect to A only helps when we apply chain rule. Otherwise the training process updates only W and b, which in turn update A.

1 Like

Yes thats true but i am asking why gardient dAL is considered as dL/dAL (derivative of loss funtion with AL)? as dW is dJ/dW(derivative of cost function w.r.t W)

\frac{\partial L}{\partial W^{[l-1]}} = \frac{\partial L}{\partial A^{[l-1]}}\frac{\partial A^{[l-1]}}{\partial W^{[l-1]}}

\frac{\partial A^{[l-1]}}{\partial W^{[l-1]}} is easy to compute - therefore we calculate it for backprop in multi layer neural networks.

1 Like

thanks for taking your time…i think you did understood my question…none the less i guess i made some deduction…and anyone can correct me on this:


As dW=dJ/dW=1/mdZA.T here the dot product here indeed performs summation of training examples and that is why dA is computed for each loss function of data set and substituted in dZ in dJ/dW. (I am bad at conveying)

It’s a good question that has come up before. Here’s an earlier thread that discusses the same points.

The point is that most of the formulas Prof Ng shows are for “layer” level Chain Rule factors and the \frac {1}{m} only comes in when you finally put all the Chain Rule factors together to compute the actual gradients of the weight or bias values. You could have structured things differently, but you need to make sure you don’t end up with multiple factors of \frac {1}{m}.

Of course computing that last factor \displaystyle \frac {\partial J}{\partial L} is easy: the gradient of the average is the average of the gradients. Think about it for a second and that should make sense.

thanks for the explanation. So can you correct me if i am wrong.
So as dW=dJ/dW==1/mdZA.T involves summation over all training examples and in which dZ=𝑑𝑍[𝑙]=𝑑𝐴[𝑙]∗𝑔′(𝑍[𝑙]).(11) invloves the term dA. Then we basically compute dA=dL/dA for each training example which indeed will be substituted in dJ/dW which involves summation over all training examples(so we dont end up in multiple 1/m factors)…

I had one more doubt… so will building a neural network first we have to deduct
the gradients of the parameters(vectorized) of the model by beforehand (manually)based on the cost function we choose and activations. That is as in this course we mostly used cross entropy loss and repectively we found the gradients . What if we had another loss function the whole gradient formulaes changes right?

  1. The model parameters are not optimized to predict the outcome accurately. So we measure a loss to quantify the difference between the actual and predicted values of y. In order to (for the lack of better words) “correct the incorrect parameters W and b” (to make the predictions closer to the actual y), we update the model parameters using the gradients.
  2. Yes, the gradient depends on the explicit form of the loss. Eg: if we used hinge loss instead of cross entropy loss the mathematical formula of the gradient will be different.
1 Like