Can anybody explain cost function of logistic regression algorithm to me?? loss function to be more precise. I couldn’t get the ideo of loss function
Hi Hamza, Cost function is a measure of the difference between estimated (calculated by the neural network) and actual outputs. Consider in a data set, for certain values x1, x2, x3 of set X, the output is Y. Your neural network calculates value Y-hat from x1, x2, x3. If Y-hat equals Y, then your cost is zero and your network performance is great. If Y-hat is not equal to zero, you run Gradient Descent on your Cost function to find the value of parameters that would give you the least cost. Least cost means that your neural network output Y-hat is as close to actual value of Y as is possible.
To continue, we need to remember one thing. You roll a dice, and toss a coin, the probability of getting a 1 AND a head is L =\frac{1}{6} \times \frac{1}{2} = \frac{1}{12}. Here, we multiply the chances of these two independent events, then we can get the joint probability of having a 1 and a head.
Now, focus on the loss function. Loss function is for assessing performance of model. And let’s say we measure the performance of our model by:-
When we feed a dataset in, how likely is it for the model to label the samples the same as observed?
In case the likelihood is 100%, then the model’s prediction of each sample match perfectly with the corresponding true label, and its performance is perfect! Given this motivation, we want to train our model such that the likelihood is maximized. Now, let’s see how to calculate this likelihood, and how it leads to the loss function you are questioning.
Let’s say we have this dataset, and let me know if you have questions about the last column of the table.
true label / observation | model prediction | probability of predicting the observation | |
---|---|---|---|
sample 1 | y_1 = 1 | f(x_1) = 0.7 | P_1 = 0.7 |
sample 2 | y_2 = 0 | f(x_2) = 0.2 | P_2 = 1 - 0.2 = 0.8 |
sample 3 | y_3 = 1 | f(x_3) = 0.8 | P_3 = 0.8 |
With the dataset, we ask the question again,
When we feed the above dataset in, how likely is it for the model to label sample 1 the same as observed (1), AND to label sample 2 the same as observed (0), AND to label sample 3 the same as observed (1) ?"
The answer is similar to the question about rolling out a 1 AND tossing out a head, which is by multiplying the probabilies, and thus the likelihood is
L = P_1 \times P_2 \times P_3 = 0.7 \times 0.8 \times 0.8 = 0.7 \times (1-0.2) \times 0.8 ,
or, we can say that,
L = f(x_1) \times (1- f(x_2)) \times f(x_3)
I hope you are still with me here. And remember, we want to maximize this likelihood. For example, now our L is 0.7 \times 0.8 \times 0.8 because our model is not doing a perfect job. Then when is it doing a perfect job? It’s when L is 1 \times 1 \times 1, agree?
And it turns out that it’s computationally better to maximize the log verion of this likelihood instead. They are equivalent because if L is maximized \log(L) is also maximized, and vice versa. Therefore, we now want to maximize the log-likelihood
\log(L) = \log(f(x_1)) + \log(1- f(x_2)) + \log(f(x_3))
Now it starts to look like our loss function, and it will look more like it after we add something:
\log(L) = \\ y_1 \log(f(x_1)) + ( 1- y_1) \log(1- f(x_1)) + \\ y_2 \log(f(x_2)) + ( 1- y_2) \log(1- f(x_2)) + \\y_3 \log(f(x_3)) + ( 1- y_3) \log(1- f(x_3))
Please examine the above closely and you will find that, even I added so many things, it does not change the value of \log(L).
And if we use the summation sign to save some lines, we have this log-likelihood
\log(L) = \sum_{i=1}^{3} {y_i \log(f(x_i)) + ( 1- y_i) \log(1- f(x_i))}
One final step, in gradient descent, we want to minimize the loss, but how can we change from maximizing the log-likelihood into minimizing the loss? In order words, how can we change our log-likelihood function into a loss function? The answer is to give it a negative sign.
Loss = -\log(L) = \sum_{i=1}^{3} {-y_i \log(f(x_i)) - ( 1- y_i) \log(1- f(x_i))}
I hope after this example, you will see that our argument for constructing this loss function is very simple. We want to train our model, such that the likelihood for the model to predict all samples as observed is maximized. Maxmizing the likelihood is equvialent to maximizing the log-likelihood, which is also equvialent to minimizing the negative log-likelihood which is the log loss which is what you are questioning about.
Cheers,
Raymond
Hello @Syed_Hamza_Tehseen
@rmwkwok has given a thorough treatment of how we came up with the log loss function.
A Back of the Envelope version of that would be:
When we have a binary classification problem, where the aim is to predict 0 or 1, we would like to maximally penalize the model if it predicts 1 when it should predict 0 and if it predicts 0 when it should predict 1
We see that the log loss equation does that exactly for us:
Loss^{(i)} = {-y^{(i)} \log(f(x^{(i)})) - ( 1- y^{(i)}) \log(1- f(x^{(i)}))}
where y^{(i)} is actual value and f(x^{(i)}) is predicted value
Case 1:
When y^{(i)} = 0 and f(x^{(i)}) = 0,
Loss^{(i)} = - \log(1- f(x^{(i)})) = -\log(1) = 0
Case 2:
When y^{(i)} = 0 and f(x^{(i)}) = 1,
Loss^{(i)} = - \log(1- f(x^{(i)})) = -\log(0) → \infty
Case 3:
When y^{(i)} = 1 and f(x^{(i)}) = 1,
Loss^{(i)} = - \log(f(x^{(i)})) = -\log(1) = 0
Case 4:
When y^{(i)} = 1 and f(x^{(i)}) = 0,
Loss^{(i)} = - \log(f(x^{(i)})) = -\log(0) → \infty
As you can see here:
When prediction = actual values, Loss^{(i)} = 0
When prediction \neq actual value Loss^{(i)} → \infty
Cost, J = \frac {1} {m} \sum_{i=1}^{m} Loss^{(i)} is the summation of the losses over all the m samples, averaged over the m samples.
Without relying on formulas, and using a back of the napkin summation, what is the difference between cost and loss? It’s been said that “loss function is for assessing the performance of model,” but I thought that’s exactly what the cost function did, as well. I’m a little confused.
Hello @jabevan,
These courses tend to use loss for one sample, and cost as the total loss from all samples. However, please don’t take this as a standard for everywhere.
Cheers,
Raymond
@rmwkwok
Great explanation of likelihood, thank you very much.
Am I right if I understand your explanation so, that the added y(i) in the formular for log(L) is basically a trick used to make the computation easier, similar to the added factor 1/2 in the cost function for linear regression and not something that is mathematically necessary?
Hello @Fabian_Harder,
That’s a great question!
1/2 definitely saves time by saving us a multiplication operation.
The use of y replaces the conditional if ... then ... else ...
with arthimetric operations, however, as for whether it makes computation easier, I think I am not the best person to give you any serious answer because I believe it requires analysis or professional computer science knowledge that I have not gone through. The only thing I can tell you is, of all the codes implemented by others that I have read, they all tend to use arthimetric operations instead of conditional statements, and I believe such preference means something. Sorry I cannot tell you anything more concrete
Cheers,
Raymond
@rmwkwok
Thanks for your quick reply.
Another thing that I overlooked last time:
Why are we multiplicating likelihoods when calculating L but summing up, when calculating log(L)?
This is because logarithms convert multiplication into addition. In other words, logarithms simplify the math. e.g.
log(xy) = log(x) + log(y)
.
Amazing, thank you, as well as @rmwkwok.