Hello @ajaykumar3456,
Yes! The textbook requirement is that we set only one label y per sample, and so we pick only the y=n loss term in the calculation of the loss. This is the same for each and every sample. If we somehow place more than one label on one sample (e.g. y=1,2,3,4), then an implementation that doesn’t regard this as an error will pick all four into the calculation of loss for that sample. However, pick only one label please, as Tom pointed out.
Very very good question! I think you were assuming that the sample has a label of 3 and then asking why we need to optimize more on the 3rd neuron while all neurons’ z get involved in the final loss.
You have a very good sense that whatever contributed to the loss will be considered in the optimization! That’s great!
Now, let’s first get back to the loss function, which is,
J = -\log(a_3) = -\log(\frac{\exp(z_3)}{... + \exp(z_2) +\exp(z_3)+... })
and please allow me to just list out 2 terms in the denominator because I think that’s sufficient to illustrate the idea.
The R.H.S. of the above equation can be rewritten as
-z_3 + \log(... + \exp(z_2) +\exp(z_3)+... )
You might have smelled something but lets take the derivative and see what happens
\frac{\partial{J}}{\partial{z_3}} = \frac{\exp(z_3)}{... + \exp(z_2) +\exp(z_3)+... } -1 = a_3 - 1
\frac{\partial{J}}{\partial{z_2}} = \frac{\exp(z_2)}{... + \exp(z_2) +\exp(z_3)+... } = a_2
By induction, we know how other gradients should look like.
So, what makes the 3rd neuron the center of this gradient descent update is the -1 term. In the descent formula we have z_3 = z_3 - \alpha \times \frac{\partial{J}}{\partial{z_3}} . What the -1 does is it will increase the z_3 (I know we update weights instead of updating z, but the idea that remains is that it adjusts relevant weights to increase z_3 as a result). Then what does the a_n term do? Because it is always positive, it decreases the corresponding z_n!! Therefore, all neurons get suppressed and only the 3rd neuron gets incited!
Now, you may ask: but the 3rd neuron also gets the suppressing a_n term! Yes, it does, but a_n is always less than or equal to one, which makes \frac{\partial{J}}{\partial{z_3}} always less than or equal to 0. Therefore, gradient descent always attempt to suppress other neurons, and always attempt to “encourage” the 3rd neurons until the 3rd neuron predicts a_3=1 because no more improvement is then required.
Cheers,
Raymond
PS: a_n is the activity value from the n-th neuron. Since it is the output layer, a_n is also the model prediction which is the probability that it is class n