Why is sigmoid activation function better for binary classification than the tanh activation function

What cost function would you use if tanh is your output activation? The cross entropy log loss function that we use will not handle outputs other than in the range 0 to 1.

If your response is, well we could shift and scale tanh to have the range (0,1), then guess what? It turns out tanh and sigmoid are very closely related mathematically, so you don’t really gain any advantage by that strategy.