In one of Prof. Ng previous courses (Machine Learning by Standford), the Sigmoid is used in a Deep NN for classifying hand written numbers 0-9, i.e. the Sigmoid is used for a multiple class case. In this current course, I understand that we use Softmax for multiple classes. Does Softmax outperform the Sigmoid for multiple classes?

How would you apply sigmoid to a multi-class classification? Because it only gives yes/no answers, right? If you recall how Prof Ng did that in the original Machine Learning course, it was to use the “one vs all” approach. So if you have 10 classes, as in your example, what you do is run the training literally 10 times: once for 0 vs all the others, once for 1 vs all the others, and so forth and thus you end up with 10 separate models. To predict the value of a given input, you run all 10 models and then select the class for which the corresponding model has the highest output.

I have not tried a real experiment to compare the results of sigmoid with “one vs all” versus softmax on a particular problem, so I don’t know whether there is a performance difference in the accuracy of the resulting models. But the one thing we can say for sure is that the cost of training the model is significantly higher in the “one vs all” case: we have to run the complete training 10 times (or whatever the number of classes is) versus once. Of course there maybe more subtleties there (e,g, maybe you need fewer iterations in each case for “one vs all” but it could just as easily be more iterations), but the overall point is that “one vs all” sounds a lot more expensive. Once you have softmax and understand how to use it, it makes everything a lot more straightforward.

One more point worth mentioning on this is that the mathematics of softmax and sigmoid are very closely related. You’ll notice that the derivative and the loss function are the same. You can think of softmax as the multiclass generalization of sigmoid.