Correct me if i am wrong. In the video, it is explained that the loss function should be a minimum. Now, it is also said that when y = 1, loss function = - log y_hat. To minimize this function, we want the value of y_hat to be largest (closest to 1) because the negative sign will make it smallest as possible. But my question here is that the negative sign simply represents the direction of difference between the actual and predicted. So if we have y_hat close to 1, then our loss function will be approximately -1. But this is bad no? because we want our loss to be 0 not -1. Because in this case 0 < -1
Let say Y_{hat} is 0.99 and Y is 1. So, what is the value of log(0.99)? It’s a negative value, so, the cost value will be -(negative value) = positive value
.
I don’t think this statement is true. Negative sign do not represent the direction.
Right! The point is that our \hat{y} values are between 0 and 1. Take a look at the graph of the natural log function and you’ll see that it is negative for the domain (0, 1). The range of the function on that domain is (-\infty, 0). So we need to multiply by -1 to get a positive value for the cross entropy loss.
Here’s a nice explanation from Raymond of cross entropy loss.
It probably comes from learning physics.
In physics this statement is actually true, the sign of a number such as force, speed, acceleration etc, represents the direction it pulls/push/accelerate.
But the equivalent of “force” in this instance is the derivative of the cost, right? So in that case, the direction is expressed by the sign. But just the cost is always positive.
lets try to break this down. The loss function is computed on a single data point with y_hat i.e the predicted outcome and y i.e the actual outcome. Suppose the model fits perfectly on a data point, i.e if y = 1(as we know y can only be 1 or 0), and y_hat is also approximately 1, then the loss function: - (ylog(y_hat) + (1-y)log(1-y_hat)) also known as the binary cross entropy function becomes 0. The same case is also true if both y and y_hat are 0.
So, the value of 0, is the best ideal value that we want for the loss function correct? We would want to tune our parameters in such a way that they yield 0 loss or close to 0 loss.
Yes, that sounds correct.
Ok i understand it now. I plotted the graph of log and then it became apparent to me. For a moment i forgot that the value of y_hat cannot exceed 1. And when it reaches 1(in the case that y is also 1), the loss function becomes 0. Thanks community!