In one of the earlier versions of this course, I remember a student asked why we specifically chose *sigmoid* as the output layer activation and asked why *tanh* couldn’t be used for that purpose. Since the range of *tanh* is (-1,1), instead of (0,1), we could define “yes” answers to be \hat{y} \geq 0. Then the question is what would we use as the loss function in that case, since the log loss (‘cross entropy’) loss function depends on the output values having the range (0,1). Well, one solution would be to shift and scale the output of *tanh* so that the range becomes (0,1) by using:

g(z) = \displaystyle \frac {tanh(z) + 1}{2}

If you do that, then the slope of g(z) is a bit steeper than *sigmoid*. Here are the two functions graphed in the domain [-5,5] with sigmoid in blue and g(z) shown in orange:

So the new g(z) will have a worse version of the “vanishing gradient” problem than

*sigmoid*, because it plateaus more aggressively. Well, we can solve that problem by scaling the input value like this:

g(z) = \displaystyle \frac {tanh(\frac{z}{2}) + 1}{2}

But then, guess what? That function is exactly the same as *sigmoid*. Here’s the derivation:

tanh(z) = \displaystyle\frac{e^z - e^{-z}}{e^z + e^{-z}}

tanh(z) = \displaystyle\frac{e^{2z} - 1}{e^{2z} + 1}

g(z) = \displaystyle\frac{1}{2}\left(tanh\left(\frac {z}{2}\right) + 1\right)

g(z) = \displaystyle\frac{1}{2}\left(\frac{e^{z} - 1}{e^{z} + 1} + 1\right)

g(z) = \displaystyle\frac{1}{2}\left(\frac{e^{z} - 1}{e^{z} + 1} + \frac{e^{z} + 1}{e^{z} + 1}\right)

g(z) = \displaystyle\frac{1}{2}\left(\frac{2e^{z} }{e^{z} + 1}\right)

g(z) = \displaystyle\left(\frac{e^{z} }{e^{z} + 1}\right)\left(\frac{e^{-z}}{e^{-z}}\right)

g(z) = \displaystyle\frac{1}{1 + e^{-z}}

So we’re back where we started: let’s just use *sigmoid* and be happy!