Why aren't we using difference of y^ and y to check the effect of w and b

paulinpaloalto · April 13, 2024, 6:14pm

It’s a good question. One thing to note is that (\hat{y} - y) could be either a positive or negative number, right? So you probably want |\hat{y} - y| if that is your approach. But then notice that function is not differentiable at z = 0. We’re going to need to take derivatives in order to use gradient descent to optimize our system.

But the more fundamental point here is that the purpose of our network here is to do “classification”, meaning a “yes” or “no” answer. So what does it mean to be close in that case? We say that if \hat{y} > 0.5, then the answer is “yes” (meaning the picture contains a cat in our current example). But suppose the picture does contain a cat and the label y = 1, then by your metric, the answer \hat{y} = 0.49 and the answer \hat{y} = 0.51 only differ by 0.2 in terms of “goodness”. But the first one gives a wrong answer and the second one gives a correct answer, right? And your metric would say that the difference between those two answers is the same as the difference between \hat{y} = 0.89 and \hat{y} = 0.91, but that’s clearly not going to be useful, right? The difference between 0.49 and 0.51 is a lot more significant than the difference between 0.89 and 0.91 for our current purposes.

So for classification problems, what I described above shows that we need a more sophisticated way to evaluate the “goodness” of our answers than just the linear difference. And that’s the purpose of the “log loss” or “cross entropy” loss function that Prof Ng shows us here. Here’s a thread which talks a bit more about how that function works.

Of course there are other kinds of problems besides classification problems. Suppose our network is trying to predict some kind of continuous number like a stock price or the temperature at 1pm tomorrow. Then a “distance” based loss function will be appropriate, but in that type of case the usual solution is to use the Mean Squared Error, which is basically the square of the normal Euclidean distance. In your example, that would be (\hat{y} - y)^2, although we’d average those values over all the samples. That has nicer mathematical properties than the absolute value of the differences.

Topic		Replies	Views
Week 2 video 3 cost function Neural Networks and Deep Learning coursera-platform	7	483	August 17, 2023
Loss Function for logistic regression confusion Neural Networks and Deep Learning week-module-2 , coursera-platform	2	320	February 26, 2024
Minor error in video - Course 1, Week 2 Neural Networks and Deep Learning coursera-platform	3	537	March 15, 2022
Loss function for logistic regression Neural Networks and Deep Learning coursera-platform	2	607	December 28, 2021
Logistic regression loss function Neural Networks and Deep Learning week-module-2 , coursera-platform	2	18	November 30, 2024

Why aren't we using difference of y^ and y to check the effect of w and b

Related topics