Derivation for log loss function in classification

Neha_Prakash · May 7, 2023, 11:25am

Hey,
can somebody help me understand the derivation behind the log loss shown for classification use case in course 2 week 3.

Degeye · May 7, 2023, 6:56pm

Hi @Neha_Prakash, to understand BCE (Binary Cross-Entropy), It is best to analyze the equation term by term to better understand the principle : the first term is relevant when the real label is 1 (in this case, the second term is cancelled). Regarding the second term, it is relevant when real label is 0 (in this case the first term is cancelled). Let’s go back to the first term, When the prediction is close to 1, the result is close to 0 (Normal that the Loss is close to 0 in this case). Now let’s look at the second term, the phenomenon is identical but when the prediction is close to 0, it means that we are close to the value of the label 0 and therefore, we also have a Loss of 0. Hoping to have given the intuition behind this equation. Have a nice evening (or day) in function of your location

Neha_Prakash · May 20, 2023, 12:00pm

HI @Degeye thank you for such a great explanation for the intuition behind this. I would still love to see the derivation behind this.

Degeye · June 1, 2023, 8:27am

I recommend to you to read some books about the ‘Information theory’ from which this concept comes from :

A Mathematical Theory of Communication (Claude Shannon)
Information Theory (James V Stone)

Have a nice day!
Stephane Degeye

Gary_Zhang2 · April 15, 2024, 12:58am

it cannot be called as a derivation, okay?

rmwkwok · April 15, 2024, 10:53pm

Hey @Gary_Zhang2,

Check out chapter 5, section 5.5 of this free book by Jurafsky and Martin. It doesn’t take many steps to derive it, and the fundamental concept behind the starting step is discussed.

Cheers,
Raymond

Gary_Zhang2 · April 16, 2024, 12:41am

thx so much, sir.

In that course, Dr. Serrano analogies the coin toss to the derivation of the loss function for binary classification. In the case of coin tossing, we wanna get the maximized chance of winning. if we toss the coin only once, the possibility should be p^y * (1-p)^(1-y). if the result is head, the y=1, 1-y=0, the equation is simplified as p.

Now I wanna turn to the cost function for binary classification. the probability should be (y_hat)^y * (1-y)^(1-y_hat). if y = 0 that is the correct answer, the term ‘label 0’ will be 0 by default. Then we can take the log for that equation, it will be y*log(y_hat)+(1-y)*log(1-y_hat).

I believe the derivation should be going like this.

rmwkwok · April 16, 2024, 1:48am

Hello @Gary_Zhang2

I agree with you. It would also be to nice to mention that we want to maximize the likelihood by having \hat{y} predicting 1 when the label is 1, because this also translates to maximizing the probability you mentioned: p^y * (1-p)^(1-y). I think perhaps your “we wanna get the maximized chance of winning” delivers a similar idea.

And maximizing the above probability is, as you said again, equivalent to maximizing the log probability, which is your y*log(y_hat)+(1-y)*log(1-y_hat), or minimizing the negative log probability, which is the log loss function.

Cheers!

Raymond

Gary_Zhang2 · April 16, 2024, 2:08am

thx for ur explanation, Raymond! I benefit a lot from u!

rmwkwok · April 16, 2024, 3:45am

You are welcome, @Gary_Zhang2!

Topic		Replies	Views
Week 1, lab 2, counting labels and weighted loss AI for Medical Diagnosis week-1	3	380	November 10, 2023
Logistic Regression Cost Function Intuition start around 3:24 Neural Networks and Deep Learning week-2	3	251	March 25, 2024
Deriving the loss function for logistic regression Supervised ML: Regression and Classification week-3	2	529	November 24, 2022
Cross-entropy Supervised ML: Regression and Classification week-3	2	36	October 23, 2024
Minor error in video - Course 1, Week 2 Neural Networks and Deep Learning	3	534	March 15, 2022

Derivation for log loss function in classification

Related topics