Why aren't we using difference of y^ and y to check the effect of w and b

I am curious that after calculating y^ we use loss function to check how good are the values of w and b. Why we use L(y^,y) and do not use simple y-y^. The difference of y-y^ will tell us deviation from actual y.

It’s a good question. One thing to note is that (\hat{y} - y) could be either a positive or negative number, right? So you probably want |\hat{y} - y| if that is your approach. But then notice that function is not differentiable at z = 0. We’re going to need to take derivatives in order to use gradient descent to optimize our system.

But the more fundamental point here is that the purpose of our network here is to do “classification”, meaning a “yes” or “no” answer. So what does it mean to be close in that case? We say that if \hat{y} > 0.5, then the answer is “yes” (meaning the picture contains a cat in our current example). But suppose the picture does contain a cat and the label y = 1, then by your metric, the answer \hat{y} = 0.49 and the answer \hat{y} = 0.51 only differ by 0.2 in terms of “goodness”. But the first one gives a wrong answer and the second one gives a correct answer, right? And your metric would say that the difference between those two answers is the same as the difference between \hat{y} = 0.89 and \hat{y} = 0.91, but that’s clearly not going to be useful, right? The difference between 0.49 and 0.51 is a lot more significant than the difference between 0.89 and 0.91 for our current purposes.

So for classification problems, what I described above shows that we need a more sophisticated way to evaluate the “goodness” of our answers than just the linear difference. And that’s the purpose of the “log loss” or “cross entropy” loss function that Prof Ng shows us here. Here’s a thread which talks a bit more about how that function works.

Of course there are other kinds of problems besides classification problems. Suppose our network is trying to predict some kind of continuous number like a stock price or the temperature at 1pm tomorrow. Then a “distance” based loss function will be appropriate, but in that type of case the usual solution is to use the Mean Squared Error, which is basically the square of the normal Euclidean distance. In your example, that would be (\hat{y} - y)^2, although we’d average those values over all the samples. That has nicer mathematical properties than the absolute value of the differences.


Thanks for your prompt response and explanation.
Please confirm if i understand correctly:
Due to sigmoid or any other activation function value of y^ can be very small. Hence, if we will use normal subtraction then it can cause error while predicting. Another reason is that if we use normal subtraction then we can not use backward propagation technique.

The point about why cross entropy loss is appropriate as the loss function for classifications does not really have to do with the fact that sigmoid output values can be very small or very close to 1 for that matter. Please read the other thread that I linked above.

We do need derivatives of the loss function in order to do back propagation and optimize the solutions. There are ways to deal with linear differences as the loss function, but in regression problems it is more common to use the square of the difference as the loss metric. The mathematical behavior of the derivatives of the squared distance is more useful. But there are cases in which the so-called L1 or “Lasso” loss is used. But none of this is really covered or that relevant in DLS Course 1.

1 Like

I understood from the link which u shared. Very well explained.
Thanks alot.


Try this next time instead of y^

Dollar sign \hat{y} dollar sign

Except use the actual $ instead of the words

It should render as \hat{y}

Welcome to LaTeX