Derivation for log loss function in classification

can somebody help me understand the derivation behind the log loss shown for classification use case in course 2 week 3.

Hi @Neha_Prakash, to understand BCE (Binary Cross-Entropy), It is best to analyze the equation term by term to better understand the principle : the first term is relevant when the real label is 1 (in this case, the second term is cancelled). Regarding the second term, it is relevant when real label is 0 (in this case the first term is cancelled). Let’s go back to the first term, When the prediction is close to 1, the result is close to 0 (Normal that the Loss is close to 0 in this case). Now let’s look at the second term, the phenomenon is identical but when the prediction is close to 0, it means that we are close to the value of the label 0 and therefore, we also have a Loss of 0. Hoping to have given the intuition behind this equation. Have a nice evening (or day) in function of your location :slight_smile:

HI @Degeye thank you for such a great explanation for the intuition behind this. I would still love to see the derivation behind this.

1 Like

I recommend to you to read some books about the ‘Information theory’ from which this concept comes from :

  • A Mathematical Theory of Communication (Claude Shannon)
  • Information Theory (James V Stone)

Have a nice day!
Stephane Degeye

1 Like

it cannot be called as a derivation, okay?

Hey @Gary_Zhang2,

Check out chapter 5, section 5.5 of this free book by Jurafsky and Martin. It doesn’t take many steps to derive it, and the fundamental concept behind the starting step is discussed.


thx so much, sir.

In that course, Dr. Serrano analogies the coin toss to the derivation of the loss function for binary classification. In the case of coin tossing, we wanna get the maximized chance of winning. if we toss the coin only once, the possibility should be p^y * (1-p)^(1-y). if the result is head, the y=1, 1-y=0, the equation is simplified as p.

Now I wanna turn to the cost function for binary classification. the probability should be (y_hat)^y * (1-y)^(1-y_hat). if y = 0 that is the correct answer, the term ‘label 0’ will be 0 by default. Then we can take the log for that equation, it will be y*log(y_hat)+(1-y)*log(1-y_hat).

I believe the derivation should be going like this.

1 Like

Hello @Gary_Zhang2

I agree with you. It would also be to nice to mention that we want to maximize the likelihood by having \hat{y} predicting 1 when the label is 1, because this also translates to maximizing the probability you mentioned: p^y * (1-p)^(1-y). I think perhaps your “we wanna get the maximized chance of winning” delivers a similar idea.

And maximizing the above probability is, as you said again, equivalent to maximizing the log probability, which is your y*log(y_hat)+(1-y)*log(1-y_hat), or minimizing the negative log probability, which is the log loss function.



thx for ur explanation, Raymond! I benefit a lot from u!

1 Like

You are welcome, @Gary_Zhang2!