Please, I need clarification for this part of the slide

Thanks in advance.

In order to calculate the total cost you are summing up two types of terms:

- If
**y = 1**, the term (1-y) part becomes 0, therefore the only important term is`-log(y_hat)`

- If
**y = 0**, then the only term we care about is`-log(1-y_hat)`

So you are summing either `-log(y_hat)`

or `-log(1-y_hat)`

depending on whether you have case 1 or 2. Then what’s important is to remember the properties of the logarithm function.

`log(1) = 0`

, and logarithm of a value lower than 1 are increasing negative numbers, for example log(0.9) = -0.04, log(0.5) = -0.30- When we only have the
`-log(y_hat)`

term we want`log(y_hat)`

to be large, so we also want`y_hat`

to be large, because the closer to 1 that value is the closer to 0 it will be`-log(y_hat)`

.

Note that

log(y_hat) largemeans that we will get a small negative number, in the example above -0.04 is larger than -0.3.

- When we only have the term
`-log(1-y_hat)`

again we want`log(1-y_hat)`

to be large in the same sense i.e. a small negative number, which implies we want the term 1-y_hat as high as possible and in consequence we want y_hat small, the lower y_hat the closer that 1-y_hat is closer to 1 and then log(1-y_hat) is closer to 0

I don’t know if it clearer now. Unfortunately discourse doesn’t support LaTex so the mathematical part doesn’t look as good as it should.

2 Likes

I understand the second case now, but concerning the first case, how to we get the output close to 1?

Hi @MoHassan, I’m not sure if I understand what you mean with the *output close to 1*, do you refer to **y_hat**. We are trying to predict that **y_hat** so it matches the ground truth, that’s **y** so we are generating values between 0 and 1.

When **y** is 0 we want **y_hat** as close to 0 as possible and when **y** is 1 we want **y_hat** as close as possible to 1.

Yeah, but in both cases I see that we are outputting **y_hat** close to 0, so the prediction can’t be 1 in the first case.

In the first case, we want **y_hat** close to **1**, so `-log(y_hat)`

is close to 0.

In the second case, we want **y_hat** close to **0**, so **(1 - y_hat)** is as close to 1 as possible and then `-log(1-y_hat)`

is close to 0.

We want just **y_hat** to be 0 or 1, or the final output to be 0 or 1.

I understand that we want the final output.

Leaving aside the cost function the main concept is that ideally we want **y_hat** to be equal to the real **y**. If that was true for all y_i then we would manage to have a total cost of 0.

When you apply that concept to the formula of the cost you will see that the closest **y_hat** is to **y** the lower cost you will have for that particular sample. If you manage to have that for all samples you are minimizing the overall cost.

Okay got it,

Thanks so much.