Wish to ask that for multi class classification problem, usu we will put softmax as the last layer with eqn of e^a/sum(e^a).

My questions are what are the advantages for softmax and does eqn like a^2/sum(a^2) will most likely to work also? Here a stand for output from last layer activation.

Yes, softmax is the preferred activation function for the output layer of a network that is doing “multiclass” classification, that is to say classification in which there are multiple possible answers, not just “yes/no”. What softmax does is convert the output values to something that can be thought of as the probability of each of the possible answers for a given input sample. It turns out that you can think of softmax as the multiclass generalization of sigmoid and the “cross entropy” loss function also works for softmax. The mathematical behavior of the losses and gradients are the same in both cases.

Prof Ng will cover softmax in Course 2 of this series, so please stay tuned for that.

I am not familiar with the other function you suggest. You can try some experiments using that and see how it works. Of course you’ll need to pick a loss function as well, but since the values are between 0 and 1, you could try “cross entropy” loss for that.