In the lecture, the cost function of multi-class classification is defined as below: -\frac{1}{m}\sum_{i=1}^{m}\sum_{j=1}^{C}y_{j}^{(i)}log(\hat{y}_{j}^{(i)})

Why is it not defined as -\frac{1}{m}\sum_{i=1}^{m}\sum_{j=1}^{C}[y_{j}^{(i)}log(\hat{y}_{j}^{(i)})+(1-y_{j}^{(i)})log(1-\hat{y}_{j}^{(i)})]

(by the way, my formula below are not displayed correctly, what is wrong ?)

which is similar to the binary classification cost function, the difference being that in the binary case the two classes are encoded in just one binary label y^{(i)}, so the term associated with (1-y^{(i)}) is playing the part of class 2.

The summation over C is an economic way to express all the terms associated with the C classes, but for any example only one of them (the class to which the example belongs) is really contributing to the sum, as in the case of binary classification.

If we use the formula you propose, then for every example you will be adding one time the term associated with the correct class y_{j}^{(i)}log(\hat{y}_{j}) and C-1 times the term (1-y_ {j}^{(i)})log(1-\hat{y}_{j}^{(i)}).

Regarding the displaying of the formulas you just have to enclose them in $ symbols as below:

The forumla proposed by me was not invented by me, but was proposed in Andrewâ€™s Machine Learning course, week5, the video of Cost function. Below is the screenshot of the video

Sorry if I misunderstood your question, I would need to review the ML course to check it in proper context.

The output consists of C units, but those are the \hat{y}_j^{(i)}, while the y_j^{(i)} are the one-hot labels, so only the one from the correct class ends up contributing to the sum.

In a sense I understand this as if the cost function only cares that the predicted \hat{y} from the correct class is close to 1, not caring about the predictions for other classes (remember that coming from a softmax all the output adds to 1, so if the output for the correct class is close to 1 then the other units must be close to 0).

Thatâ€™s right, however the â€ścost contributed by themâ€ť depends entirely on how you define the cost function, which in this case, being a classification problem, only adds the output for the correct class, zeroing out the â€ścontributionâ€ť from incorrect classes.

The intuition I was trying to convey is that for a classification problem, the cost function focus on how good you are in approximating the correct class, and by virtue of being good at it, given the nature of the softmax, you end up by being good at recognizing the incorrect classes as well.

In the first one, the hypothesis constraint that the sum of the output units is 1, thus the cost of the correct class contains the information/cost from the incorrect classes already.

While in the second oneâ€™s hypothesis, the activation function is sgimoid, and each unit is independent of the others. Thus we have to add up the cost of each output unit one by one.