Clarification W2 Cost Function

Please, I need clarification for this part of the slide

Thanks in advance.

In order to calculate the total cost you are summing up two types of terms:

  1. If y = 1, the term (1-y) part becomes 0, therefore the only important term is -log(y_hat)
  2. If y = 0, then the only term we care about is -log(1-y_hat)

So you are summing either -log(y_hat) or -log(1-y_hat) depending on whether you have case 1 or 2. Then what’s important is to remember the properties of the logarithm function.

  1. log(1) = 0, and logarithm of a value lower than 1 are increasing negative numbers, for example log(0.9) = -0.04, log(0.5) = -0.30
  2. When we only have the -log(y_hat) term we want log(y_hat) to be large, so we also want y_hat to be large, because the closer to 1 that value is the closer to 0 it will be -log(y_hat).

Note that log(y_hat) large means that we will get a small negative number, in the example above -0.04 is larger than -0.3.

  1. When we only have the term -log(1-y_hat) again we want log(1-y_hat) to be large in the same sense i.e. a small negative number, which implies we want the term 1-y_hat as high as possible and in consequence we want y_hat small, the lower y_hat the closer that 1-y_hat is closer to 1 and then log(1-y_hat) is closer to 0

I don’t know if it clearer now. Unfortunately discourse doesn’t support LaTex so the mathematical part doesn’t look as good as it should.


I understand the second case now, but concerning the first case, how to we get the output close to 1?

Hi @MoHassan, I’m not sure if I understand what you mean with the output close to 1, do you refer to y_hat. We are trying to predict that y_hat so it matches the ground truth, that’s y so we are generating values between 0 and 1.

When y is 0 we want y_hat as close to 0 as possible and when y is 1 we want y_hat as close as possible to 1.

Yeah, but in both cases I see that we are outputting y_hat close to 0, so the prediction can’t be 1 in the first case.

In the first case, we want y_hat close to 1, so -log(y_hat) is close to 0.

In the second case, we want y_hat close to 0, so (1 - y_hat) is as close to 1 as possible and then -log(1-y_hat) is close to 0.

We want just y_hat to be 0 or 1, or the final output to be 0 or 1.
I understand that we want the final output.

Leaving aside the cost function the main concept is that ideally we want y_hat to be equal to the real y. If that was true for all y_i then we would manage to have a total cost of 0.

When you apply that concept to the formula of the cost you will see that the closest y_hat is to y the lower cost you will have for that particular sample. If you manage to have that for all samples you are minimizing the overall cost.

Okay got it,
Thanks so much.