Loss of Multiclass Classification

Hello! I hope you are doing well.

I am wondering how we generalize the loss (loss = -logaN for y = N).

From logistic regression:
loss = -loga1 for y = 1
loss = -loga2 for y = 0
We got this form by putting values of y in a complete equation of logistic loss (left side of the attached figure).
Using this, if we put y = 3, loss will be
loss = -3loga1 - (1-3)loga2
loss = -3loga1 +2loga2
I put a2 = 1 - a1 and tried solving it using log properties but did not get the generalizable form (loss = -loga3 for y=3). Kindly someone explains this.

Furthermore, as we know that loss = -loga2 for y = 0 but this is same for y = 2. Kindly clarify this too. I will be thankful to you.

Saif Ur Rehman.

For multiple classes with an NN, there will be multiple output units (one per class). The total cost is the sum of the cost for each output unit. The ‘y’ values are converted into a one-hot representation, so we can use true/false prediction for each output unit.

This is discussed further in the 5th video in the Multiclass Classification section.

Hello Saif @saifkhanengr,

We do not derive the log loss from any equation on this slide. There are many ways to discuss the origination of the log loss l = -\log{a_n} where y = n, and one of which is by considering maximizing the likelihood of model parameters given the observed data.

Consider y^{(i)} = c_i for sample i and the corresponding model prediction for class c_i is a^{(i)}_{c_i}, then we can say that, given the model which is trained by the observed data, the likelihood function for the model to predict all the observations (training data) is

a^{(1)}_{c_1} \times a^{(2)}_{c_2} \times ... \times a^{(m)}_{c_m} which is basically the joint probability assuming that samples are independent of each other.

A good model will maximize this likelihood, or equivalently will maximize the log version of this likelihood, and therefore we can convert multiplications into additions and yield

\log{a^{(1)}_{c_1}} + \log{a^{(2)}_{c_2}} + ... + \log{a^{(m)}_{c_m}}

which is how we get the “general form” of the log loss for each sample to be \log{a^{(n)}_{c_n}} .

From this form, we can derive the log loss formula for the binary case which is on the L.H.S. of the slide screenshot.


1 Like

Thanks, Raymond, for correcting me.

1 Like