Why do we use softmax at last layer? Our prediction will simply be the class whose corresponding Z value was maximum(before softmax is applied). One reason I think of is help with back prop since softmax will have a well defined derivative. Is there anything else I am missing out?

The softmax function is used as the activation function in the output layer of neural network models that predict a multinomial probability distribution. That is, softmax is used as the activation function for multi-class classification problems where class membership is required on more than two class labels.

Thanks for the answer. But lets say when we have to predict the output class in a multiclass classification, why can’t we simply see the highest Z value(like 5 in [5,4,1,2,3]) and predict our our put with that. Softmax will anyway be predicting the class with maximum Z value so why do the extra step of applying softmax?

You can think of *softmax* as the multi-class generalization of *sigmoid*. You’re right that you could simply select the highest Z value as the prediction, but then how do you compute a loss function with that method? If you use *softmax* which “normalizes” the prediction outputs to look like a probability distribution, you can then use the cross entropy loss function and the mathematical properties are identical to what you get in the binary classification case with *sigmoid* and cross entropy loss.

In addition to the material that Prof Ng provides here about softmax, you might also find it useful to watch Prof Geoff Hinton’s lecture about softmax and cross entropy loss. You can find it on YouTube here.

Thanks a lot for the detailed answer.