Softmax Loss Function for single example

In the Loss function for the Softmax, If there is an single example pair (X1,y1) with 5 classes as the output.
y = (1,2,3,4)

if y1=3, then Is it the case that we calculate only the loss for y1=3 like -log(y1)?

Hello @ajaykumar3456,

I think it depends on the implementation of the loss function. Some implementation might raise an error saying that it expects only one class for each sample. For Tensorflow, I think it won’t raise any error, instead it will take all the four classes into account and calculate \sum_{n=1,2,3,4} log(y_n) for that sample where y_n is the predicted probability of the n-th class.


1 Like

As taught in this course, if there are “N” labels, then there are “N” outputs. All of them contribute to the cost value.

  • N-1 of them will be false (0).
  • one of them will be true (1).

So, When computing the loss function for a single example, it is going to consider only the where y==j

Same for the all the training examples respective to their corresponding target labels

Does that make sense?

And, I understand, you calculate the loss for that particular training example where y==j but, you use all the z’s to calculate the loss in the denominator.
But, How will you know that the 3rd neuron should be improved to output a bigger value that will correspond to the real output?

Hello @ajaykumar3456,

Yes! The textbook requirement is that we set only one label y per sample, and so we pick only the y=n loss term in the calculation of the loss. This is the same for each and every sample. If we somehow place more than one label on one sample (e.g. y=1,2,3,4), then an implementation that doesn’t regard this as an error will pick all four into the calculation of loss for that sample. However, pick only one label please, as Tom pointed out.

Very very good question! I think you were assuming that the sample has a label of 3 and then asking why we need to optimize more on the 3rd neuron while all neurons’ z get involved in the final loss.

You have a very good sense that whatever contributed to the loss will be considered in the optimization! That’s great!

Now, let’s first get back to the loss function, which is,

J = -\log(a_3) = -\log(\frac{\exp(z_3)}{... + \exp(z_2) +\exp(z_3)+... })

and please allow me to just list out 2 terms in the denominator because I think that’s sufficient to illustrate the idea.

The R.H.S. of the above equation can be rewritten as

-z_3 + \log(... + \exp(z_2) +\exp(z_3)+... )

You might have smelled something but lets take the derivative and see what happens

\frac{\partial{J}}{\partial{z_3}} = \frac{\exp(z_3)}{... + \exp(z_2) +\exp(z_3)+... } -1 = a_3 - 1

\frac{\partial{J}}{\partial{z_2}} = \frac{\exp(z_2)}{... + \exp(z_2) +\exp(z_3)+... } = a_2

By induction, we know how other gradients should look like.

So, what makes the 3rd neuron the center of this gradient descent update is the -1 term. In the descent formula we have z_3 = z_3 - \alpha \times \frac{\partial{J}}{\partial{z_3}} . What the -1 does is it will increase the z_3 (I know we update weights instead of updating z, but the idea that remains is that it adjusts relevant weights to increase z_3 as a result). Then what does the a_n term do? Because it is always positive, it decreases the corresponding z_n!! Therefore, all neurons get suppressed and only the 3rd neuron gets incited!

Now, you may ask: but the 3rd neuron also gets the suppressing a_n term! Yes, it does, but a_n is always less than or equal to one, which makes \frac{\partial{J}}{\partial{z_3}} always less than or equal to 0. Therefore, gradient descent always attempt to suppress other neurons, and always attempt to “encourage” the 3rd neurons until the 3rd neuron predicts a_3=1 because no more improvement is then required.


PS: a_n is the activity value from the n-th neuron. Since it is the output layer, a_n is also the model prediction which is the probability that it is class n


I think the negatives in exp(-z) shouldn’t be there, it should be exp(z). The negatives are rather in front of the log.


Thank you very much @Basit_Kareem!! I put the sign in the wrong place. I am going to make the corrections.


You are welcome. Mistakes help us get better.

I also did the same while solving my assignment. I thought that is how it should be since that is how it was with the sigmoid function. The auto-grader flagged my answer as incorrect, so I had to check my note to see what was wrong since I was sure I understood the problem.

1 Like

That’s very true, and I am very glad someone pointed my mistakes out. It takes people’s time to read over your work before they can tell you your mistakes. I appreciate it, @Basit_Kareem! :smiley:


1 Like

First of all thank you for the very deep explanation. I understand you have invested a lot of time in solving the doubts, we learners get. I really appreciate your support. And, Sorry for the late response as It took me a bit to understand everything and come up with questions.

  1. Here from your explanation, What I understood was if there are 4 classes, then we need to calculate the derivatives for 4 a’s. I’ve attached the picture below. Does that makes sense?
    (Like dj/da1, dj/da2, dj/da3, dj/da4)
  2. I’m not very good at math. May I know how other gradients look like. If I’m not wrong, do they sum up to 1
  3. When I try to expand the derivative for the 3rd neuron term, It shows that only the learning rate term is being added to the update which I feel is very less. does that create an impact of inciting the 3rd neuron?
  4. Might be a silly question but, I didn’t understand why the dj/dz3 term will be less than or equal to 0 and Why?
  5. Why is An term always positive?
    And, My questions are all in the context that given the initial weights, and A value for the 3rd neuron is not the maximum among all the other 3 neurons which should be the case if the true label is y = 3

Adding on,

Except the A3 output neuron, the other A1, A2 and A4 (As there are only 4 classes), these values will be reducing. Am I right?

The Whole Idea is to make the Z3 value bigger which in turn makes the corresponding A3 value much bigger with the help of exponentiation.
Z3 is becoming bigger with the help of the update you mentioned.
Even though the denominator term has all the Z’s which doesn’t matter. Does that makes sense?

Hello @ajaykumar3456,

Thank you. Thank you for the response.

I will hold on going into further details about the maths first.

We had in one of my previous replys that

\frac{\partial{J}}{\partial{z_3}} = a_3 - 1
\frac{\partial{J}}{\partial{z_2}} = a_2

The others are,

\frac{\partial{J}}{\partial{z_n}} = a_n where n \ne 3

If you add all \frac{\partial{J}}{\partial{z_n}} up, the sum is not equal to 1. Remember there is a -1 in \frac{\partial{J}}{\partial{z_3}}

I don’t want to go into the maths now.

Again, we have \frac{\partial{J}}{\partial{z_3}} = a_3 - 1.

Remember that a_3 is the probability that the sample is label 3, and any probability value is between 0 and 1, so a_3 - 1 is between -1 and 0.

a_n is the probability that the sample is label n, and any probability value is between 0 and 1.

That’s alright.

The full sentence is: given one and only one sample in the training process, if that sample has a label of 3, then the weights will be adjusted such that the output of the 3rd neuron a_3 will be getting closer to 1 while all the others will get close to 0. I think you are right.

Yes. Note that the denominator is the same for every output in the output layer, so the denominator doesn’t affect the ordering of the neurons. If z_3 is the largest, a_3 stays the largest.

Any other questions to my responses except for the maths?


1 Like

Oh my God! Thank you so much. I couldn’t have asked for a better explanation. I understood every bit of it and more.
I really appreciate your help @rmwkwok

1 Like

OK @ajaykumar3456! I have just finished this, so now comes the maths!

We usually use zero-based index.

Sorry to bother, but, one last question, you calculate the loss w.r.t the 3rd neuron only as the target label is y=3 and then take the derivatives for all z’s. Am I right?

Yes, because J is equal to just ONE of the four there.


The choice depends on the label, so you are right!


In addition to Raymond’s always excellent and detailed explanations, maybe we could summarize the point about why the loss only has one term corresponding to the actual label of the training sample this way:

What that says is that we are ignoring all the wrong parts of the answer: we don’t care how the “wrongness” is distributed among the possible wrong answers. All we care about is how much “rightness” we have in the one label that we actually care about. If there are 4 possible answers, remember that the 4 prediction values all add up to 1. We don’t care whether the three possible wrong answers have equally distributed values (1/3 of whatever is left over from the right answer) or whether all the “wrongness” is concentrated in one wrong answer. It literally doesn’t matter to us: all we care about is the value predicted for the label we actually want and we want the gradients to drive that one value to be as close to 1 as possible, which will as a natural side effect make all the wrong values smaller.


Thanks @paulinpaloalto