To continue, we need to remember one thing. You roll a dice, and toss a coin, the probability of getting a 1 AND a head is L =\frac{1}{6} \times \frac{1}{2} = \frac{1}{12}. Here, we multiply the chances of these two independent events, then we can get the joint probability of having a 1 and a head.
Now, focus on the loss function. Loss function is for assessing performance of model. And let’s say we measure the performance of our model by:-
When we feed a dataset in, how likely is it for the model to label the samples the same as observed?
In case the likelihood is 100%, then the model’s prediction of each sample match perfectly with the corresponding true label, and its performance is perfect! Given this motivation, we want to train our model such that the likelihood is maximized. Now, let’s see how to calculate this likelihood, and how it leads to the loss function you are questioning.
Let’s say we have this dataset, and let me know if you have questions about the last column of the table.
true label / observation | model prediction | probability of predicting the observation | |
---|---|---|---|
sample 1 | y_1 = 1 | f(x_1) = 0.7 | P_1 = 0.7 |
sample 2 | y_2 = 0 | f(x_2) = 0.2 | P_2 = 1 - 0.2 = 0.8 |
sample 3 | y_3 = 1 | f(x_3) = 0.8 | P_3 = 0.8 |
With the dataset, we ask the question again,
When we feed the above dataset in, how likely is it for the model to label sample 1 the same as observed (1), AND to label sample 2 the same as observed (0), AND to label sample 3 the same as observed (1) ?"
The answer is similar to the question about rolling out a 1 AND tossing out a head, which is by multiplying the probabilies, and thus the likelihood is
L = P_1 \times P_2 \times P_3 = 0.7 \times 0.8 \times 0.8 = 0.7 \times (1-0.2) \times 0.8 ,
or, we can say that,
L = f(x_1) \times (1- f(x_2)) \times f(x_3)
I hope you are still with me here. And remember, we want to maximize this likelihood. For example, now our L is 0.7 \times 0.8 \times 0.8 because our model is not doing a perfect job. Then when is it doing a perfect job? It’s when L is 1 \times 1 \times 1, agree?
And it turns out that it’s computationally better to maximize the log verion of this likelihood instead. They are equivalent because if L is maximized \log(L) is also maximized, and vice versa. Therefore, we now want to maximize the log-likelihood
\log(L) = \log(f(x_1)) + \log(1- f(x_2)) + \log(f(x_3))
Now it starts to look like our loss function, and it will look more like it after we add something:
\log(L) = \\ y_1 \log(f(x_1)) + ( 1- y_1) \log(1- f(x_1)) + \\ y_2 \log(f(x_2)) + ( 1- y_2) \log(1- f(x_2)) + \\y_3 \log(f(x_3)) + ( 1- y_3) \log(1- f(x_3))
Please examine the above closely and you will find that, even I added so many things, it does not change the value of \log(L).
And if we use the summation sign to save some lines, we have this log-likelihood
\log(L) = \sum_{i=1}^{3} {y_i \log(f(x_i)) + ( 1- y_i) \log(1- f(x_i))}
One final step, in gradient descent, we want to minimize the loss, but how can we change from maximizing the log-likelihood into minimizing the loss? In order words, how can we change our log-likelihood function into a loss function? The answer is to give it a negative sign.
Loss = -\log(L) = \sum_{i=1}^{3} {-y_i \log(f(x_i)) - ( 1- y_i) \log(1- f(x_i))}
I hope after this example, you will see that our argument for constructing this loss function is very simple. We want to train our model, such that the likelihood for the model to predict all samples as observed is maximized. Maxmizing the likelihood is equvialent to maximizing the log-likelihood, which is also equvialent to minimizing the negative log-likelihood which is the log loss which is what you are questioning about.
Cheers,
Raymond