Week 2: Loss Function Portrays..?


I thought I could ask a question here that I was asked in an interview for a Machine Learning Engineer post.

When asked, I briefly explained to them the working of neural network, gradient decent and loss function. About loss function I said it helps in the optimization of our model, we estimate it to measure how well our model is performing and we want to minimize it in order to have our predictions closer to true values.

To which I was asked “What is the loss about? What are we losing?”

I have never thought about what actually the loss is portraying but on the top of my head I I think it is loss of information. As in ML, we basically tend to derive parameters of a function based on our dataset, that can truly mimic the behavior in case of unseen data as well. And if our loss is high, it means our model is failing at that, which is basically the loss of information.

This just what I think. I am not an expert at NN and I would really like to hear everybody’s answer to this. Also, correct me if I messed up any concept.

Thanks in advance. :slight_smile:

I guess you could say it is “loss of information” in a very general sense, but the real point is that you can be much more specific than that. The key point is that you need a metric for “distance” that measures how far your prediction is from the desired “label” value (correct answer) for a given prediction. Of course how you measure distance completely depends on the nature and meaning of your output values. There are two large categories:

  1. “Regression” problems where the output is a continuous real number that predicts some value like a stock price, temperature, atmospheric pressure or some other numeric value.
  2. “Classification” problems where the output looks like the probability that the given input falls into any of the defined categories. E.g. “image contains a cat” or “image does not contain a cat” (binary classification) or a multiclass case (cat, dog, horse, elephant, aardvark, wallaby …).

The mathematical properties of those two types of outputs are fundamentally different, so you need different “loss” functions in those two general cases. For the regression case, the usual solution is to use an actual notion of distance based on the normal “Euclidean distance” (square root of the sum of the squares of the differences in the elements of two vectors). But for convenience, the most common approach is to use the square of the Euclidean Distance. It is cheaper to compute and has nicer mathematical properties for computing gradients.

For the “classification” case, the Euclidean distance does not work well at all. Consider the case of a binary classification. If the label of a given sample is true (1), then there is a big difference between predicting 0.49 and 0.51, but a Euclidean distance metric will consider that error the same as the difference between 0.96 and 0.98, right? So clearly we need a different metric. It may not be so easy to explain how we get to the familiar “cross entropy” loss function based on logarithms, but it comes from the world of statistics and “maximum likelihood estimation”. Because the outputs look like probabilities, maybe that gives some intuition for why statistics is the place to look. From a practical standpoint, you can show why it works by looking at the basic formula:

L(y, \hat{y}) = - y * log(\hat{y})

That is for the case that y = 1, so it’s really this function:

L(y, \hat{y}) = - log(\hat{y})

If you draw that graph, you can see that the curve works well: as \hat{y} \rightarrow 1, L \rightarrow 0 and as \hat{y} \rightarrow 0, L \rightarrow \infty with the curve getting very steep the closer you get to 0.

Here’s a graph of just log(\hat{y}) for 0 < \hat{y} < 1, so you have to flip it vertically about the x axis to get the positive loss value: