Does loss have 'meaning'-- Or is it merely relative?

Hi,

So I just completed C5W1A2 and a question occurred to me.

Whereas in previous courses/exercises typically our loss would (hopefully) drop to very low levels, here, even after 22000 iterations, and though it did decrease , the measure of loss still remained quite high (22.728886).

Granted, in this case we are performing generation and not training to a hard and fast test set.

But that got me thinking: Does the measure of loss really have any ‘meaning’ ? I mean it is not MSE, or RMSE, or accuracy, etc. And if so, what is it ?

Second, if it is only a ‘relative’ measure, is the fact that it tends to decrease all we should care about, and not, in the end, the actual magnitude ?

1 Like

For cross-entropy, the primary concern, I guess, is to check whether the loss is decreasing over time or not. If it is, it means our model is learning and we are on the right track. But that does not imply the quality of a model. I’ve seen several models with a small loss but still poor predictions. Also, the loss doesn’t need to go to 0.

1 Like

The minimum loss doesn’t have to be a small number. You’re just trying to find the minimum. The absolute magnitude doesn’t matter.

1 Like

If you are computing loss, then it means you have a hard and fast training set and/or test set and you are computing the loss on that specific dataset. Any kind of loss doesn’t mean anything unless you have a label, right? Otherwise how do you even compute it? What is the y value?

As always, it is really accuracy that you actually care about. Cross entropy loss measures how good the solution is, but that is not equivalent to accuracy. Think about it even in the regression case with MSE: just the Euclidean distance between \hat{y} and y is only a relative measure, right? Suppose I’m predicting a distance and my prediction is off by 42 kilometers. Is that good or bad? You don’t know until you know what the y value is, right? If I’m predicting an astronomical distance (e.g. the distance from the Sun to Alpha Centauri), that would be an astonishingly tiny error. If I’m predicting the circumference of the Earth, then it’s pretty good, but not amazingly good. But if I’m measuring the distance between a car and a pedestrian for a self-driving application, then that would be a completely useless answer of course. So in the regression case loss is also relative. So accuracy in that case would probably be computed as something like:

\displaystyle \frac {||\hat{y} - y||}{||y||}

In the classification case it’s perhaps a bit more “fuzzy”, if that’s the right word. Note that accuracy for a classification is quantized, which means that a lower cost does not guarantee a better accuracy value, even for a single sample. If I have a sample with y = 1 and \hat{y} = 0.51 or \hat{y} = 0.95, either will give full accuracy, but that latter will have lower cost. But once I’ve run the training enough iterations to get \hat{y} > 0.5, then the cost contribution from that sample may get lower without raising the overall accuracy.

In this particular case (the Dinosaur Names assignment), there are really no “correct” answers. The standard is really whether the generated names are aesthetically pleasing or not. But we still use the existing database of names to define the loss.

So the purpose of the loss function is to drive things in the direction you want them to go. How far you have to drive them requires other measurements besides the pure cost.

Try some experiments like the one I listed above and look at the values.

1 Like

If by “does loss have meaning” you are referring to what its units are, then it’s best to assume it’s a unitless metric.

Because the actual units are going to vary with the data set and method of computing the loss.

For example, using the “sum of the squares of the errors” cost method, and if you’re tying to predict housing prices, the units of cost will be “dollars squared”.

That’s not really helpful.

1 Like

@paulinpaloalto I will look at your suggestions and get back; and to both (@TMosh )-- Perhaps ‘meaning’ is too strong a word; Perhaps ‘interpretability’ ?

I mean I know the ‘Dinos’ case was a bit different, but I am trying to think ahead towards my own models in the future to understand here. Previous to this one, most seemed to want to converge to zero-- But, I guess as long as my loss is decreasing (even if it is still far from zero) I am okay ?

1 Like

Yes.

You may be able to get lower minimums by altering the model, but that’s still a relative comparison.

1 Like

perhaps towards your point Paul, but I am working on C5W2A2 now, and you are correct, the ‘important’ part, accuracy is going up-- But wait a minute here…

The cost is going up!?

If you look even within an epoch, it seems to increase:

How the h*ll is the gradient (which I thought is supposed to be minimizing our loss…) going up but providing better accuracy?

I feel really confused now…

1 Like

If you’re training on mini-batches, then the cost can vary wildly as the model starts to see new data it hasn’t seen before.

1 Like

Remember how we compute the cost when we are doing minibatch gradient descent: we sum the cost across the minibatches and don’t compute the “real” cost, which is the average over all samples, until we get to the end of the epoch. That is because we can’t compute the average at the minibatch level and then average those: the math doesn’t work if the minibatch size does not evenly divide the total training set size.

Go back and look at how this was done when minibatch was introduced in Course 2 Week 2 in the Optimization assignment and then again in the TensorFlow Intro assignment.

That statement as written doesn’t make any sense. What does it mean for the “gradient to go up”? The gradient is a derivative vector, right? So I assume you meant cost. I think the real explanation is what I just said above, but remember the point about accuracy being quantized in a classification. The cost can go down and the accuracy can get worse in aggregate. The cost can also go either direction but the accuracy stays the same in a single iteration, even on a single sample. But what is normally the case (unless you are having convergence problems) is that “in aggregate”, meaning the average cost over multiple epochs will go down and the accuracy will go up.

1 Like

The other possibilities here are that your model is somehow wrong and the test cases don’t catch it or you have modified the test code and the “real” training code. My results in both cases look nothing like yours. Here’s the output of the test cell:


Here’s my output for the real training showing some of the middle epochs:

All small numbers for cost and they mostly go down with a bit of bouncing in a couple of places. Note that you must have modified the “metrics” to get it to show the intermediate minibatch cost values.

1 Like

@paulinpaloalto Hmm… Well I got 100% for the assignment and at least for the final runs I think I am looking okay…

And, yes, when I finish the specialization (which I expect, soon), I plan to go through all the lectures again and take notes on the parts I’d forgotten/realize I’d have trouble with now that I have a better ‘lay of the land’.

*And yes, I modded it-- I brought the print statement inside the loop.

1 Like

But keep in mind that you will lost access to assignments and quizzes once you complete the specialization. So, it will be better to download all the assignments and associated files as you go along the way. You will have access to the videos only once you complete the specialization…

3 Likes