In one of the earlier versions of this course, I remember a student asked why we specifically chose sigmoid as the output layer activation and asked why tanh couldn’t be used for that purpose. Since the range of tanh is (-1,1), instead of (0,1), we could define “yes” answers to be \hat{y} \geq 0. Then the question is what would we use as the loss function in that case, since the log loss (‘cross entropy’) loss function depends on the output values having the range (0,1). Well, one solution would be to shift and scale the output of tanh so that the range becomes (0,1) by using:
g(z) = \displaystyle \frac {tanh(z) + 1}{2}
If you do that, then the slope of g(z) is a bit steeper than sigmoid. Here are the two functions graphed in the domain [-5,5] with sigmoid in blue and g(z) shown in orange:
So the new g(z) will have a worse version of the “vanishing gradient” problem than sigmoid, because it plateaus more aggressively. Well, we can solve that problem by scaling the input value like this:
g(z) = \displaystyle \frac {tanh(\frac{z}{2}) + 1}{2}
But then, guess what? That function is exactly the same as sigmoid. Here’s the derivation:
tanh(z) = \displaystyle\frac{e^z - e^{-z}}{e^z + e^{-z}}
tanh(z) = \displaystyle\frac{e^{2z} - 1}{e^{2z} + 1}
g(z) = \displaystyle\frac{1}{2}\left(tanh\left(\frac {z}{2}\right) + 1\right)
g(z) = \displaystyle\frac{1}{2}\left(\frac{e^{z} - 1}{e^{z} + 1} + 1\right)
g(z) = \displaystyle\frac{1}{2}\left(\frac{e^{z} - 1}{e^{z} + 1} + \frac{e^{z} + 1}{e^{z} + 1}\right)
g(z) = \displaystyle\frac{1}{2}\left(\frac{2e^{z} }{e^{z} + 1}\right)
g(z) = \displaystyle\left(\frac{e^{z} }{e^{z} + 1}\right)\left(\frac{e^{-z}}{e^{-z}}\right)
g(z) = \displaystyle\frac{1}{1 + e^{-z}}
So we’re back where we started: let’s just use sigmoid and be happy!