I guess you could say it is “loss of information” in a very general sense, but the real point is that you can be much more specific than that. The key point is that you need a metric for “distance” that measures how far your prediction is from the desired “label” value (correct answer) for a given prediction. Of course how you measure distance completely depends on the nature and meaning of your output values. There are two large categories:

- “Regression” problems where the output is a continuous real number that predicts some value like a stock price, temperature, atmospheric pressure or some other numeric value.
- “Classification” problems where the output looks like the probability that the given input falls into any of the defined categories. E.g. “image contains a cat” or “image does not contain a cat” (binary classification) or a multiclass case (cat, dog, horse, elephant, aardvark, wallaby …).

The mathematical properties of those two types of outputs are fundamentally different, so you need different “loss” functions in those two general cases. For the regression case, the usual solution is to use an actual notion of distance based on the normal “Euclidean distance” (square root of the sum of the squares of the differences in the elements of two vectors). But for convenience, the most common approach is to use the square of the Euclidean Distance. It is cheaper to compute and has nicer mathematical properties for computing gradients.

For the “classification” case, the Euclidean distance does not work well at all. Consider the case of a binary classification. If the label of a given sample is true (1), then there is a big difference between predicting 0.49 and 0.51, but a Euclidean distance metric will consider that error the same as the difference between 0.96 and 0.98, right? So clearly we need a different metric. It may not be so easy to explain how we get to the familiar “cross entropy” loss function based on logarithms, but it comes from the world of statistics and “maximum likelihood estimation”. Because the outputs look like probabilities, maybe that gives some intuition for why statistics is the place to look. From a practical standpoint, you can show why it works by looking at the basic formula:

L(y, \hat{y}) = - y * log(\hat{y})

That is for the case that y = 1, so it’s really this function:

L(y, \hat{y}) = - log(\hat{y})

If you draw that graph, you can see that the curve works well: as \hat{y} \rightarrow 1, L \rightarrow 0 and as \hat{y} \rightarrow 0, L \rightarrow \infty with the curve getting very steep the closer you get to 0.

Here’s a graph of just log(\hat{y}) for 0 < \hat{y} < 1, so you have to flip it vertically about the x axis to get the positive loss value: