It’s a good question. One thing to note is that (\hat{y} - y) could be either a positive or negative number, right? So you probably want |\hat{y} - y| if that is your approach. But then notice that function is not differentiable at z = 0. We’re going to need to take derivatives in order to use gradient descent to optimize our system.

But the more fundamental point here is that the purpose of our network here is to do “classification”, meaning a “yes” or “no” answer. So what does it mean to be close in that case? We say that if \hat{y} > 0.5, then the answer is “yes” (meaning the picture contains a cat in our current example). But suppose the picture does contain a cat and the label y = 1, then by your metric, the answer \hat{y} = 0.49 and the answer \hat{y} = 0.51 only differ by 0.2 in terms of “goodness”. But the first one gives a wrong answer and the second one gives a correct answer, right? And your metric would say that the difference between those two answers is the same as the difference between \hat{y} = 0.89 and \hat{y} = 0.91, but that’s clearly not going to be useful, right? The difference between 0.49 and 0.51 is a lot more significant than the difference between 0.89 and 0.91 for our current purposes.

So for classification problems, what I described above shows that we need a more sophisticated way to evaluate the “goodness” of our answers than just the linear difference. And that’s the purpose of the “log loss” or “cross entropy” loss function that Prof Ng shows us here. Here’s a thread which talks a bit more about how that function works.

Of course there are other kinds of problems besides classification problems. Suppose our network is trying to predict some kind of continuous number like a stock price or the temperature at 1pm tomorrow. Then a “distance” based loss function will be appropriate, but in that type of case the usual solution is to use the Mean Squared Error, which is basically the square of the normal Euclidean distance. In your example, that would be (\hat{y} - y)^2, although we’d average those values over all the samples. That has nicer mathematical properties than the absolute value of the differences.