Hi @Ali_Ghadimi. There appears to be a dropped minus sign on that first slise that you show. On the slide, Prof Ng is showing that cross-entropy loss can be derived from the maximum likelihood principal: given the data (assumed to be drawn from the “correct” distribution) what parameters are most likely to explain/predict the data. In other words, which parameters maximize the (log) likelihood function?
The underlying distribution in the (log) likelihood function (at top) is the Bernoulli (Binomial) distribution–the basic distribution for a weighted coin toss. In the AI disciplines, typically the problem is couched in terms of a loss function. What are the parameters most likely to minimize the loss associated with deviations from the actual data. Hence, the objective function is multiplied by -1 to turn it into a minimization problem.
Postscript: To my mind the script-L function \mathcal{L} is suggestive of “loss” and so should include the minus sign. Prof Ng goes another way, but drops the minus sign in the last line (assuming that this snapshot was not taken a second before the minus sign appears) as if he too fell to the ambiguity. You confusion is quite understandable. Paraphrasing LaPlace (I think), “half the battle of mathematics is the invention of a good notation.”
Right! The underlying point here is that the loss involves the logarithm of numbers between 0 and 1. Those logarithms are negative, so we multiply by -1 to get a positive cost value.
Yes, the slides are a bit confusing. But you just have to keep in mind what I said in my previous reply: the logarithms are all negative and we need the cost to be positive. So it’s just a question of where you put the minus sign: on the individual terms, inside the parens inside the sum, outside the summation (factored out) or incorporated into the definition of L(\hat{y}^{(i)}, y^{(i)}).