Softmax output layer vs k sigmoid units in output layer

why can’t we just take k sigmoid unit in output layer for k class classification and predict the output as max of the output of those 10 units.
In this case we can train our neural network by taking y=[1,0,0,0,0,0,0,0,0,0] for y=1 and y=[0,0,0,0,1,0,0,0,0,0] for y=5 and etc. for 10 class classification.

Does softmax layer gives better results as compared to taking K output sigmoid units for k classification.

Hi @sandeep_kumar13 ,

You could use k sigmoids instead of softmax but it can be less efficient in many cases. Firstly, softmax produces probabilities that sum up to 1 across all classes, while k sigmoids do not. K sigmoids generate k independent probabilities, making it harder to interpret the results. Secondly, softmax is computationally efficient as it calculates the exponents once and shares them across all classes. The softmax function has a smooth and convex loss function, which is easier to optimize compared to k sigmoids, which can have many local minima.

Thirdly, softmax involves fewer parameters and can be optimized more efficiently. In contrast, each sigmoid in k sigmoids has its own set of parameters, which is not ideal when using regularizations.

Overall, while k sigmoids can be an option for k-class classification, softmax is often preferred due to its efficient computation, smooth and convex loss function, fewer parameters, and easier optimization and many practitioners prefer to use softmax for its superior performance and ease of implementation.


Hi @Mujassim_Jamal
Thank you very much for such a wonderful explanation.