Tanh and sigmoid are closely related

In one of the earlier versions of this course, I remember a student asked why we specifically chose sigmoid as the output layer activation and asked why tanh couldn’t be used for that purpose. Since the range of tanh is (-1,1), instead of (0,1), we could define “yes” answers to be \hat{y} \geq 0. Then the question is what would we use as the loss function in that case, since the log loss (‘cross entropy’) loss function depends on the output values having the range (0,1). Well, one solution would be to shift and scale the output of tanh so that the range becomes (0,1) by using:

g(z) = \displaystyle \frac {tanh(z) + 1}{2}

If you do that, then the slope of g(z) is a bit steeper than sigmoid. Here are the two functions graphed in the domain [-5,5] with sigmoid in blue and g(z) shown in orange:

So the new g(z) will have a worse version of the “vanishing gradient” problem than sigmoid, because it plateaus more aggressively. Well, we can solve that problem by scaling the input value like this:

g(z) = \displaystyle \frac {tanh(\frac{z}{2}) + 1}{2}

But then, guess what? That function is exactly the same as sigmoid. Here’s the derivation:

tanh(z) = \displaystyle\frac{e^z - e^{-z}}{e^z + e^{-z}}
tanh(z) = \displaystyle\frac{e^{2z} - 1}{e^{2z} + 1}
g(z) = \displaystyle\frac{1}{2}\left(tanh\left(\frac {z}{2}\right) + 1\right)
g(z) = \displaystyle\frac{1}{2}\left(\frac{e^{z} - 1}{e^{z} + 1} + 1\right)
g(z) = \displaystyle\frac{1}{2}\left(\frac{e^{z} - 1}{e^{z} + 1} + \frac{e^{z} + 1}{e^{z} + 1}\right)
g(z) = \displaystyle\frac{1}{2}\left(\frac{2e^{z} }{e^{z} + 1}\right)
g(z) = \displaystyle\left(\frac{e^{z} }{e^{z} + 1}\right)\left(\frac{e^{-z}}{e^{-z}}\right)
g(z) = \displaystyle\frac{1}{1 + e^{-z}}

So we’re back where we started: let’s just use sigmoid and be happy! :nerd_face:


I really like this post :1st_place_medal:

Nice, @paulinpaloalto. And for those that find Gaussian probability descriptions compelling (usually by appeals to central limit theorems, the sigmoid (i.e. logit) is way closer to the normal than the scaled version of the tanh. Your (cool!) chart reveals that the pdf of the scaled tanh has very little mass in the tails. Whereas the sigmoid closely approximates the normal (and is numerically easier to work with). So, not appealing on probability grounds either. And I thought today would be boring. :nerd_face: