Cost function of multi-class classification

In the lecture, the cost function of multi-class classification is defined as below:
-\frac{1}{m}\sum_{i=1}^{m}\sum_{j=1}^{C}y_{j}^{(i)}log(\hat{y}_{j}^{(i)})

Why is it not defined as
-\frac{1}{m}\sum_{i=1}^{m}\sum_{j=1}^{C}[y_{j}^{(i)}log(\hat{y}_{j}^{(i)})+(1-y_{j}^{(i)})log(1-\hat{y}_{j}^{(i)})]

(by the way, my formula below are not displayed correctly, what is wrong ?)

-\frac{1}{m}\sum_{i=1}^{m}\sum_{j=1}^{C}y_{j}^{(i)}log(\hat{y}_{j}^{(i)})

-\frac{1}{m}\sum_{i=1}^{m}\sum_{j=1}^{C}[y_{j}^{(i)}log(\hat{y}{j}^{(i)}+(1-y{j}^{(i)})log(1-\hat{y}_{j}^{(i)})]

Hi @mc04xkf,

Notice that if we expand the summation over C, you can express the formula defined in lecture as:

-\frac{1}{m}\sum_{i=1}^{m}[ y_{1}^{(i)}log(\hat{y}_{1}^{(i)}) + y_{2}^{(i)}log(\hat{y}_{2}^{(i)}) + ... + y_{C}^{(i)}log(\hat{y}_{C}^{(i)}) ]

which is similar to the binary classification cost function, the difference being that in the binary case the two classes are encoded in just one binary label y^{(i)}, so the term associated with (1-y^{(i)}) is playing the part of class 2.

The summation over C is an economic way to express all the terms associated with the C classes, but for any example only one of them (the class to which the example belongs) is really contributing to the sum, as in the case of binary classification.

If we use the formula you propose, then for every example you will be adding one time the term associated with the correct class y_{j}^{(i)}log(\hat{y}_{j}) and C-1 times the term (1-y_ {j}^{(i)})log(1-\hat{y}_{j}^{(i)}).

Regarding the displaying of the formulas you just have to enclose them in $ symbols as below:

$ formula here $.

HI @kampamocha

The forumla proposed by me was not invented by me, but was proposed in Andrew’s Machine Learning course, week5, the video of Cost function. Below is the screenshot of the video

My understanding is that the output consists of C units, the cost is contributed by C units, not just the largest y_{j}^{(i)}

Hi @mc04xkf,

Sorry if I misunderstood your question, I would need to review the ML course to check it in proper context.

The output consists of C units, but those are the \hat{y}_j^{(i)}, while the y_j^{(i)} are the one-hot labels, so only the one from the correct class ends up contributing to the sum.

In a sense I understand this as if the cost function only cares that the predicted \hat{y} from the correct class is close to 1, not caring about the predictions for other classes (remember that coming from a softmax all the output adds to 1, so if the output for the correct class is close to 1 then the other units must be close to 0).

HI Kampamocha,

As you said, “the other units must be close to 0”, but they are not 0, so I think there are cost contributed by them.

Hello again @mc04xkf,

That’s right, however the “cost contributed by them” depends entirely on how you define the cost function, which in this case, being a classification problem, only adds the output for the correct class, zeroing out the “contribution” from incorrect classes.

The intuition I was trying to convey is that for a classification problem, the cost function focus on how good you are in approximating the correct class, and by virtue of being good at it, given the nature of the softmax, you end up by being good at recognizing the incorrect classes as well.

There are some resources that could be useful.

http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/
https://rstudio-pubs-static.s3.amazonaws.com/337306_79a7966fad184532ab3ad66b322fe96e.html

1 Like

HI Kampamocha

Thanks for replying.
Why was the cost function defined differently in the Machine Learning course then ? Are both definition correct ?

I think I understand now.

The two are for different hypothesis.

In the first one, the hypothesis constraint that the sum of the output units is 1, thus the cost of the correct class contains the information/cost from the incorrect classes already.

While in the second one’s hypothesis, the activation function is sgimoid, and each unit is independent of the others. Thus we have to add up the cost of each output unit one by one.