for the loss function Adrew says we want “log y_hat” to be large. My expectation is that we want log y_hat to be small…meaning we want y_hat to be large. Running log 0.999 (y_hat is large) gives 0.0001 which i believe is our objective. We want the loss function to be close to zero as much as possible. I do see why we want y_hat to be large but i cant get my head around why we want log y_hat to be large.
Similarly, my expectation is we want “log(1-y_hat)” to be small, not large.
If y_hat is large, then log(y_hat) is large and -log(y_hat) (Note the negative sign) is small.
For example,
if y_hat = 0.9 (large), then log(y_hat) = -0.1 (large) and -log(y_hat) = 0.1 (small);
if y_hat = 0.1 (small), then log(y_hat) = -2.3 (small) and -log(y_hat) = 2.3 (large);
So, if y = 1, we want y_hat to be large, or that we want log(y_hat) to be large. Try to lay down the details for the case of y = 0 yourself, and hopefully you will see we want log(1-y_hat) to be large as well.
The way Prof Ng expresses this is perhaps a bit confusing. The thing to keep in mind is that log(\hat{y}) and log(1 - \hat{y}) are non-positive numbers (<= 0). So by making it large, he means pushing it further to the right on the number line. For a negative number that means closer to 0, right?
if y_hat = 0.9 (large), then log(y_hat) = -0.1 (large) and -log(y_hat) = 0.1 (small)
I got stuck at the above explanation and sort of gave up on understanding what is being explained. The part where -0.1 is large and +0.1 is small. On a numberline it would be the reverse
Actually, one must understand that y_hat is less than 1. Meaning it a fraction.
So the log of any fraction is always negative. Think of it this way…what number is x that makes 10^(x) = 0.1. It has to be a negative number.
After understanding that the log of any fraction is negative, its a case of comparing negative numbers against other negative numbers, not against positives. So log(0.9) =-0.1 is large relative to if y_hat was 0.1, meaning against log(0.1) = -2.3. …in summary -0.1 is large relative to -2.3. Similar to Paul explanation of the numberline.
I am not sure if there is an easier way to explain it the way Andrew did, but it needs a twisted way of thinking to get it.
I would have said…we want y_hat large , so that we get log y_yat large, noting that log_hat is always negative [note the order i put them]