hello everyone
can anyone tell me what is the importance of calculating the loss since we will not use it in the backpropagation step??
hello everyone
can anyone tell me what is the importance of calculating the loss since we will not use it in the backpropagation step??
Hey @abdou_brk,
Welcome to the community. I am really confused as to why are you asking this question in Week 4. Prof Andrew has explicitly explained the importance of loss function multiple times for back-propagation in the previous weeks.
If the loss function is not used in back-propagation, then let me ask you this, in your opinion, how are the gradients calculated which are used in back-propagation to update the parameters? Don’t we take the derivative of the loss function with respect to the parameters? If you are confused about this, then I strongly urge you to watch the lecture videos once again, as loss function is what starts the back-propagation, and without it, we can’t even take the first backprop step, and Prof Andrew has explained it pretty well in the lecture videos. I hope this helps.
Cheers,
Elemento
Elemento has done a great job of explaining how back propagation works: it is all driven by the derivatives of the Cost/Loss function w.r.t. the parameters. So the Loss function matters deeply (pun intended) to how everything works here: that is how we can move from a randomly initialized set of weights to a better and better solution during training. But it is also interesting to note that the actual scalar J value (the average of the loss across the samples) isn’t actually used directly in back propagation. It’s just an inexpensive proxy for whether convergence is working or not. You’ll see that we frequently plot the J value periodically during training to make sure it is decreasing. When it “plateaus”, then you know that either that’s the best you can do or that you need to consider adjusting some of your hyperparameters like learning rate or number of iterations to get better convergence.
Hey @paulinpaloalto Sir,
Thanks a lot for mentioning this here. I kinda missed this perspective when curating my answer
Cheers,
Elemento
yes that’s the point because in the backprop equations : dZ
dZ [L]=A [L]−Y
dW [L] = 1/mdZ [L]A [L−1] T…and so on
i don’t see when we’ll use the value of the cost J
since we start by computing dZ [L]=A [L]−Y and so we only need our predicted values and the outpouts Y to start backprop
Well, the derivatives of L are just Chain Rule factors that contribute to the overall gradients, right? Prof Ng’s notation is a little ambiguous, but remember that:
dW^{[l]} = \displaystyle \frac {\partial J}{\partial W^{[l]}}
That’s why it ends up including the sum and the factor of \frac {1}{m}. Because J is the average of L and the derivative of the average is the average of the derivatives. Think about it for a second and that should make sense, because taking averages is a linear operation.
Notice that only dW^{[l]} and db^{[l]} have the factor of \frac {1}{m}. That’s because they are the only terms in any of this that are derivatives of J, as opposed to L or some smaller part of the whole Chain Rule computation.
yes i understand that we calculate the derivatives of J but what i want to say is that let’s asume for the sake of simplicity that our cost function J is J(x,y)=x^2+y then we 'll compute ∂J/∂x and ∂J/∂y which are our derivatives with respect to the parameters x and y (w and b in our reel cost function) therefore we calculate
∂J/∂x=2x and ∂J/∂y=1 so we don’t actually need to compute the value of J(x,y)=x^2+y . So i think that the value of the cost function isn’t used while training a neural network , we can simply calculate it to check how our model perform during the iterations of gradient descent . Is that true?
Yes, I agree. That’s what I was trying to say in one of my earlier replies on this thread, but maybe your way of saying it is clearer: