In the Loss function for the Softmax, If there is an single example pair (X1,y1) with 5 classes as the output.
y = (1,2,3,4)
if y1=3, then Is it the case that we calculate only the loss for y1=3 like -log(y1)?
In the Loss function for the Softmax, If there is an single example pair (X1,y1) with 5 classes as the output.
y = (1,2,3,4)
if y1=3, then Is it the case that we calculate only the loss for y1=3 like -log(y1)?
Hello @ajaykumar3456,
I think it depends on the implementation of the loss function. Some implementation might raise an error saying that it expects only one class for each sample. For Tensorflow, I think it won’t raise any error, instead it will take all the four classes into account and calculate \sum_{n=1,2,3,4} log(y_n) for that sample where y_n is the predicted probability of the n-th class.
Cheers,
Raymond
As taught in this course, if there are “N” labels, then there are “N” outputs. All of them contribute to the cost value.
So, When computing the loss function for a single example, it is going to consider only the where y==j
Same for the all the training examples respective to their corresponding target labels
Does that make sense?
And, I understand, you calculate the loss for that particular training example where y==j but, you use all the z’s to calculate the loss in the denominator.
But, How will you know that the 3rd neuron should be improved to output a bigger value that will correspond to the real output?
Hello @ajaykumar3456,
Yes! The textbook requirement is that we set only one label y per sample, and so we pick only the y=n loss term in the calculation of the loss. This is the same for each and every sample. If we somehow place more than one label on one sample (e.g. y=1,2,3,4), then an implementation that doesn’t regard this as an error will pick all four into the calculation of loss for that sample. However, pick only one label please, as Tom pointed out.
Very very good question! I think you were assuming that the sample has a label of 3 and then asking why we need to optimize more on the 3rd neuron while all neurons’ z get involved in the final loss.
You have a very good sense that whatever contributed to the loss will be considered in the optimization! That’s great!
Now, let’s first get back to the loss function, which is,
J = -\log(a_3) = -\log(\frac{\exp(z_3)}{... + \exp(z_2) +\exp(z_3)+... })
and please allow me to just list out 2 terms in the denominator because I think that’s sufficient to illustrate the idea.
The R.H.S. of the above equation can be rewritten as
-z_3 + \log(... + \exp(z_2) +\exp(z_3)+... )
You might have smelled something but lets take the derivative and see what happens
\frac{\partial{J}}{\partial{z_3}} = \frac{\exp(z_3)}{... + \exp(z_2) +\exp(z_3)+... } -1 = a_3 - 1
\frac{\partial{J}}{\partial{z_2}} = \frac{\exp(z_2)}{... + \exp(z_2) +\exp(z_3)+... } = a_2
By induction, we know how other gradients should look like.
So, what makes the 3rd neuron the center of this gradient descent update is the -1 term. In the descent formula we have z_3 = z_3 - \alpha \times \frac{\partial{J}}{\partial{z_3}} . What the -1 does is it will increase the z_3 (I know we update weights instead of updating z, but the idea that remains is that it adjusts relevant weights to increase z_3 as a result). Then what does the a_n term do? Because it is always positive, it decreases the corresponding z_n!! Therefore, all neurons get suppressed and only the 3rd neuron gets incited!
Now, you may ask: but the 3rd neuron also gets the suppressing a_n term! Yes, it does, but a_n is always less than or equal to one, which makes \frac{\partial{J}}{\partial{z_3}} always less than or equal to 0. Therefore, gradient descent always attempt to suppress other neurons, and always attempt to “encourage” the 3rd neurons until the 3rd neuron predicts a_3=1 because no more improvement is then required.
Cheers,
Raymond
PS: a_n is the activity value from the n-th neuron. Since it is the output layer, a_n is also the model prediction which is the probability that it is class n
I think the negatives in exp(-z) shouldn’t be there, it should be exp(z). The negatives are rather in front of the log.
Thank you very much @Basit_Kareem!! I put the sign in the wrong place. I am going to make the corrections.
You are welcome. Mistakes help us get better.
I also did the same while solving my assignment. I thought that is how it should be since that is how it was with the sigmoid function. The auto-grader flagged my answer as incorrect, so I had to check my note to see what was wrong since I was sure I understood the problem.
That’s very true, and I am very glad someone pointed my mistakes out. It takes people’s time to read over your work before they can tell you your mistakes. I appreciate it, @Basit_Kareem!
Cheers,
Raymond
First of all thank you for the very deep explanation. I understand you have invested a lot of time in solving the doubts, we learners get. I really appreciate your support. And, Sorry for the late response as It took me a bit to understand everything and come up with questions.
Adding on,
Except the A3 output neuron, the other A1, A2 and A4 (As there are only 4 classes), these values will be reducing. Am I right?
The Whole Idea is to make the Z3 value bigger which in turn makes the corresponding A3 value much bigger with the help of exponentiation.
Z3 is becoming bigger with the help of the update you mentioned.
Even though the denominator term has all the Z’s which doesn’t matter. Does that makes sense?
Hello @ajaykumar3456,
Thank you. Thank you for the response.
I will hold on going into further details about the maths first.
We had in one of my previous replys that
\frac{\partial{J}}{\partial{z_3}} = a_3 - 1
\frac{\partial{J}}{\partial{z_2}} = a_2
The others are,
\frac{\partial{J}}{\partial{z_n}} = a_n where n \ne 3
If you add all \frac{\partial{J}}{\partial{z_n}} up, the sum is not equal to 1. Remember there is a -1 in \frac{\partial{J}}{\partial{z_3}}
I don’t want to go into the maths now.
Again, we have \frac{\partial{J}}{\partial{z_3}} = a_3 - 1.
Remember that a_3 is the probability that the sample is label 3, and any probability value is between 0 and 1, so a_3 - 1 is between -1 and 0.
a_n is the probability that the sample is label n, and any probability value is between 0 and 1.
That’s alright.
The full sentence is: given one and only one sample in the training process, if that sample has a label of 3, then the weights will be adjusted such that the output of the 3rd neuron a_3 will be getting closer to 1 while all the others will get close to 0. I think you are right.
Yes. Note that the denominator is the same for every output in the output layer, so the denominator doesn’t affect the ordering of the neurons. If z_3 is the largest, a_3 stays the largest.
Any other questions to my responses except for the maths?
Raymond
Oh my God! Thank you so much. I couldn’t have asked for a better explanation. I understood every bit of it and more.
I really appreciate your help @rmwkwok
OK @ajaykumar3456! I have just finished this, so now comes the maths!
We usually use zero-based index.
Sorry to bother, but, one last question, you calculate the loss w.r.t the 3rd neuron only as the target label is y=3 and then take the derivatives for all z’s. Am I right?
Yes, because J is equal to just ONE of the four there.
The choice depends on the label, so you are right!
In addition to Raymond’s always excellent and detailed explanations, maybe we could summarize the point about why the loss only has one term corresponding to the actual label of the training sample this way:
What that says is that we are ignoring all the wrong parts of the answer: we don’t care how the “wrongness” is distributed among the possible wrong answers. All we care about is how much “rightness” we have in the one label that we actually care about. If there are 4 possible answers, remember that the 4 prediction values all add up to 1. We don’t care whether the three possible wrong answers have equally distributed values (1/3 of whatever is left over from the right answer) or whether all the “wrongness” is concentrated in one wrong answer. It literally doesn’t matter to us: all we care about is the value predicted for the label we actually want and we want the gradients to drive that one value to be as close to 1 as possible, which will as a natural side effect make all the wrong values smaller.
Thanks @paulinpaloalto