Why is sigmoid activation function better for binary classification than the tanh activation function

Sigmoid gives a value between 0 and 1 and we can use threshold of 0.5 to round off the output to 0 and 1 which makes a good binary classifier.

But, isn’t it the same case with tanh which gives a value between -1 and 1?
We can use 0 as threshold to round off the output to -1 and 1.

In one of the justifications to the answer in the Quiz of Week 3, it says – Tanh is less convenient as the output is between -1 and 1. I don’t understand how?!

What cost function would you use if tanh is your output activation? The cross entropy log loss function that we use will not handle outputs other than in the range 0 to 1.

If your response is, well we could shift and scale tanh to have the range (0,1), then guess what? It turns out tanh and sigmoid are very closely related mathematically, so you don’t really gain any advantage by that strategy.


Thanks for the detailed answer!