Cost Function and Loss Function

Syed_Hamza_Tehseen · July 14, 2022, 9:51am

Can anybody explain cost function of logistic regression algorithm to me?? loss function to be more precise. I couldn’t get the ideo of loss function

Umar_Faran_Mir · July 14, 2022, 11:42am

Hi Hamza, Cost function is a measure of the difference between estimated (calculated by the neural network) and actual outputs. Consider in a data set, for certain values x1, x2, x3 of set X, the output is Y. Your neural network calculates value Y-hat from x1, x2, x3. If Y-hat equals Y, then your cost is zero and your network performance is great. If Y-hat is not equal to zero, you run Gradient Descent on your Cost function to find the value of parameters that would give you the least cost. Least cost means that your neural network output Y-hat is as close to actual value of Y as is possible.

rmwkwok · July 14, 2022, 12:39pm

To continue, we need to remember one thing. You roll a dice, and toss a coin, the probability of getting a 1 AND a head is L =\frac{1}{6} \times \frac{1}{2} = \frac{1}{12}. Here, we multiply the chances of these two independent events, then we can get the joint probability of having a 1 and a head.

Now, focus on the loss function. Loss function is for assessing performance of model. And let’s say we measure the performance of our model by:-

When we feed a dataset in, how likely is it for the model to label the samples the same as observed?

In case the likelihood is 100%, then the model’s prediction of each sample match perfectly with the corresponding true label, and its performance is perfect! Given this motivation, we want to train our model such that the likelihood is maximized. Now, let’s see how to calculate this likelihood, and how it leads to the loss function you are questioning.

Let’s say we have this dataset, and let me know if you have questions about the last column of the table.

	true label / observation	model prediction	probability of predicting the observation
sample 1	y_1 = 1	f(x_1) = 0.7	P_1 = 0.7
sample 2	y_2 = 0	f(x_2) = 0.2	P_2 = 1 - 0.2 = 0.8
sample 3	y_3 = 1	f(x_3) = 0.8	P_3 = 0.8

With the dataset, we ask the question again,

When we feed the above dataset in, how likely is it for the model to label sample 1 the same as observed (1), AND to label sample 2 the same as observed (0), AND to label sample 3 the same as observed (1) ?"

The answer is similar to the question about rolling out a 1 AND tossing out a head, which is by multiplying the probabilies, and thus the likelihood is

L = P_1 \times P_2 \times P_3 = 0.7 \times 0.8 \times 0.8 = 0.7 \times (1-0.2) \times 0.8 ,

or, we can say that,

L = f(x_1) \times (1- f(x_2)) \times f(x_3)

I hope you are still with me here. And remember, we want to maximize this likelihood. For example, now our L is 0.7 \times 0.8 \times 0.8 because our model is not doing a perfect job. Then when is it doing a perfect job? It’s when L is 1 \times 1 \times 1, agree?

And it turns out that it’s computationally better to maximize the log verion of this likelihood instead. They are equivalent because if L is maximized \log(L) is also maximized, and vice versa. Therefore, we now want to maximize the log-likelihood

\log(L) = \log(f(x_1)) + \log(1- f(x_2)) + \log(f(x_3))

Now it starts to look like our loss function, and it will look more like it after we add something:

\log(L) = \\ y_1 \log(f(x_1)) + ( 1- y_1) \log(1- f(x_1)) + \\ y_2 \log(f(x_2)) + ( 1- y_2) \log(1- f(x_2)) + \\y_3 \log(f(x_3)) + ( 1- y_3) \log(1- f(x_3))

Please examine the above closely and you will find that, even I added so many things, it does not change the value of \log(L).

And if we use the summation sign to save some lines, we have this log-likelihood

\log(L) = \sum_{i=1}^{3} {y_i \log(f(x_i)) + ( 1- y_i) \log(1- f(x_i))}

One final step, in gradient descent, we want to minimize the loss, but how can we change from maximizing the log-likelihood into minimizing the loss? In order words, how can we change our log-likelihood function into a loss function? The answer is to give it a negative sign.

Loss = -\log(L) = \sum_{i=1}^{3} {-y_i \log(f(x_i)) - ( 1- y_i) \log(1- f(x_i))}

I hope after this example, you will see that our argument for constructing this loss function is very simple. We want to train our model, such that the likelihood for the model to predict all samples as observed is maximized. Maxmizing the likelihood is equvialent to maximizing the log-likelihood, which is also equvialent to minimizing the negative log-likelihood which is the log loss which is what you are questioning about.

Cheers,
Raymond

shanup · July 14, 2022, 5:18pm

Hello @Syed_Hamza_Tehseen

@rmwkwok has given a thorough treatment of how we came up with the log loss function.
A Back of the Envelope version of that would be:

When we have a binary classification problem, where the aim is to predict 0 or 1, we would like to maximally penalize the model if it predicts 1 when it should predict 0 and if it predicts 0 when it should predict 1

We see that the log loss equation does that exactly for us:

Loss^{(i)} = {-y^{(i)} \log(f(x^{(i)})) - ( 1- y^{(i)}) \log(1- f(x^{(i)}))}
where y^{(i)} is actual value and f(x^{(i)}) is predicted value

Case 1:
When y^{(i)} = 0 and f(x^{(i)}) = 0,
Loss^{(i)} = - \log(1- f(x^{(i)})) = -\log(1) = 0

Case 2:
When y^{(i)} = 0 and f(x^{(i)}) = 1,
Loss^{(i)} = - \log(1- f(x^{(i)})) = -\log(0) → \infty

Case 3:
When y^{(i)} = 1 and f(x^{(i)}) = 1,
Loss^{(i)} = - \log(f(x^{(i)})) = -\log(1) = 0

Case 4:
When y^{(i)} = 1 and f(x^{(i)}) = 0,
Loss^{(i)} = - \log(f(x^{(i)})) = -\log(0) → \infty

As you can see here:
When prediction = actual values, Loss^{(i)} = 0
When prediction \neq actual value Loss^{(i)} → \infty

Cost, J = \frac {1} {m} \sum_{i=1}^{m} Loss^{(i)} is the summation of the losses over all the m samples, averaged over the m samples.

jabevan · January 6, 2023, 9:53pm

Without relying on formulas, and using a back of the napkin summation, what is the difference between cost and loss? It’s been said that “loss function is for assessing the performance of model,” but I thought that’s exactly what the cost function did, as well. I’m a little confused.

rmwkwok · January 7, 2023, 2:16am

Hello @jabevan,

These courses tend to use loss for one sample, and cost as the total loss from all samples. However, please don’t take this as a standard for everywhere.

Cheers,
Raymond

Fabian_Harder · September 16, 2023, 8:25pm

@rmwkwok
Great explanation of likelihood, thank you very much.
Am I right if I understand your explanation so, that the added y(i) in the formular for log(L) is basically a trick used to make the computation easier, similar to the added factor 1/2 in the cost function for linear regression and not something that is mathematically necessary?

rmwkwok · September 16, 2023, 11:36pm

Hello @Fabian_Harder,

That’s a great question!

1/2 definitely saves time by saving us a multiplication operation.

The use of y replaces the conditional if ... then ... else ... with arthimetric operations, however, as for whether it makes computation easier, I think I am not the best person to give you any serious answer because I believe it requires analysis or professional computer science knowledge that I have not gone through. The only thing I can tell you is, of all the codes implemented by others that I have read, they all tend to use arthimetric operations instead of conditional statements, and I believe such preference means something. Sorry I cannot tell you anything more concrete

Cheers,
Raymond

Fabian_Harder · September 18, 2023, 7:38am

@rmwkwok
Thanks for your quick reply.
Another thing that I overlooked last time:
Why are we multiplicating likelihoods when calculating L but summing up, when calculating log(L)?

lukmanaj · September 18, 2023, 10:19am

This is because logarithms convert multiplication into addition. In other words, logarithms simplify the math. e.g.
log(xy) = log(x) + log(y).

Fabian_Harder · September 20, 2023, 6:42pm

Amazing, thank you, as well as @rmwkwok.

Topic		Replies	Views
Numerical example Neural Networks and Deep Learning coursera-platform	4	555	June 25, 2021
Unclear how to rewrite the logistic loss function? Supervised ML: Regression and Classification week-module-3	9	846	January 31, 2024
Difference between Cost function and Loss function Supervised ML: Regression and Classification week-module-3	2	519	March 18, 2023
Week 2 : Logistic Regression Cost Function Video Neural Networks and Deep Learning coursera-platform	1	631	May 7, 2021
The Cost function sign (- or +) Neural Networks and Deep Learning coursera-platform	5	559	February 5, 2022

Cost Function and Loss Function

Related topics