Why softmax is used

LimXiuXian96 · August 4, 2021, 11:52pm

For multi class classification, softmax is often used. My question is why softmax is used with formula of e^a/sum(e^a) and does things like a^2/sum(a^2) will work for multi class classification. Just curious

Nicolas · August 5, 2021, 12:21pm

That is an interesting idea, however you should think about other advantages of the exponential over the square :
The exponential is strictly positive and increasing. So it sends ]-oo ; +oo[ to ]0; +oo[ monotonously.
whereas
The square sends ]-oo ; +oo[ to [0; +oo[ , zero included, with decreasing and increasing regions.

So imagine that in your last layer, before softmax activation, you would like that a high value translates to a high probability, regardless of the sign of the output, then the square will be troublesome.

Of course you may have other reasons to prefer softmax to be what it is

I hope that helps !

Have a nice day

LimXiuXian96 · August 6, 2021, 2:18am

Dear sir,
I can roughly get the idea. So basically, softmax use exponential function enables it to monotonously increasing while square function does not.

The reason is that you want high positive value from the last activation only to have high probability. While the square function will actually spread out the probability (both high positive and high negative value from last activation), right? Thanks for the explanation.

Nicolas · August 6, 2021, 6:55am

Maybe a simple example will make things clear :
Imagine an output before softmaxing

case 1 : (1, 0.5) (even if for binary classification you would have preferred a single value)
case 2 : (-1, 0.5)

With softmax, case1 results in the class 0, and case 2 results in the class 1
But with the square softmax-like, case 2 gives a class 0 output identical to class 1.

Of course your model could adapt to this choice, but it looks simpler that an output (a, b) before activation turns to the class with the highest number regardless to the sign questions

Topic		Replies	Views
Softmax layer at last layer Neural Networks and Deep Learning coursera-platform	1	525	April 15, 2022
Why softmax is used, if we can do same thing with the sigmoid function? Advanced Learning Algorithms week-module-2	14	1189	February 9, 2023
Softmax layer intuition Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	551	August 6, 2021
Why softmax in last layer for multiclass NN? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	567	January 7, 2022
Week2 softmax function Advanced Learning Algorithms week-module-2	1	27	May 17, 2025

Why softmax is used

Related topics