Why softmax is used

For multi class classification, softmax is often used. My question is why softmax is used with formula of e^a/sum(e^a) and does things like a^2/sum(a^2) will work for multi class classification. Just curious

That is an interesting idea, however you should think about other advantages of the exponential over the square :
The exponential is strictly positive and increasing. So it sends ]-oo ; +oo[ to ]0; +oo[ monotonously.
whereas
The square sends ]-oo ; +oo[ to [0; +oo[ , zero included, with decreasing and increasing regions.

So imagine that in your last layer, before softmax activation, you would like that a high value translates to a high probability, regardless of the sign of the output, then the square will be troublesome.

Of course you may have other reasons to prefer softmax to be what it is

I hope that helps !

Have a nice day

1 Like

Dear sir,
I can roughly get the idea. So basically, softmax use exponential function enables it to monotonously increasing while square function does not.

The reason is that you want high positive value from the last activation only to have high probability. While the square function will actually spread out the probability (both high positive and high negative value from last activation), right? Thanks for the explanation.

Maybe a simple example will make things clear :
Imagine an output before softmaxing

  • case 1 : (1, 0.5) (even if for binary classification you would have preferred a single value)
  • case 2 : (-1, 0.5)

With softmax, case1 results in the class 0, and case 2 results in the class 1
But with the square softmax-like, case 2 gives a class 0 output identical to class 1.

Of course your model could adapt to this choice, but it looks simpler that an output (a, b) before activation turns to the class with the highest number regardless to the sign questions