Tanh and sigmoid are closely related

paulinpaloalto · July 27, 2021, 11:36pm

In one of the earlier versions of this course, I remember a student asked why we specifically chose sigmoid as the output layer activation and asked why tanh couldn’t be used for that purpose. Since the range of tanh is (-1,1), instead of (0,1), we could define “yes” answers to be \hat{y} \geq 0. Then the question is what would we use as the loss function in that case, since the log loss (‘cross entropy’) loss function depends on the output values having the range (0,1). Well, one solution would be to shift and scale the output of tanh so that the range becomes (0,1) by using:

g(z) = \displaystyle \frac {tanh(z) + 1}{2}

If you do that, then the slope of g(z) is a bit steeper than sigmoid. Here are the two functions graphed in the domain [-5,5] with sigmoid in blue and g(z) shown in orange:

So the new g(z) will have a worse version of the “vanishing gradient” problem than sigmoid, because it plateaus more aggressively. Well, we can solve that problem by scaling the input value like this:

g(z) = \displaystyle \frac {tanh(\frac{z}{2}) + 1}{2}

But then, guess what? That function is exactly the same as sigmoid. Here’s the derivation:

tanh(z) = \displaystyle\frac{e^z - e^{-z}}{e^z + e^{-z}}
tanh(z) = \displaystyle\frac{e^{2z} - 1}{e^{2z} + 1}
g(z) = \displaystyle\frac{1}{2}\left(tanh\left(\frac {z}{2}\right) + 1\right)
g(z) = \displaystyle\frac{1}{2}\left(\frac{e^{z} - 1}{e^{z} + 1} + 1\right)
g(z) = \displaystyle\frac{1}{2}\left(\frac{e^{z} - 1}{e^{z} + 1} + \frac{e^{z} + 1}{e^{z} + 1}\right)
g(z) = \displaystyle\frac{1}{2}\left(\frac{2e^{z} }{e^{z} + 1}\right)
g(z) = \displaystyle\left(\frac{e^{z} }{e^{z} + 1}\right)\left(\frac{e^{-z}}{e^{-z}}\right)
g(z) = \displaystyle\frac{1}{1 + e^{-z}}

So we’re back where we started: let’s just use sigmoid and be happy!

jonaslalin · November 17, 2021, 3:24pm

I really like this post

kenb · November 17, 2021, 4:08pm

Nice, @paulinpaloalto. And for those that find Gaussian probability descriptions compelling (usually by appeals to central limit theorems, the sigmoid (i.e. logit) is way closer to the normal than the scaled version of the tanh. Your (cool!) chart reveals that the pdf of the scaled tanh has very little mass in the tails. Whereas the sigmoid closely approximates the normal (and is numerically easier to work with). So, not appealing on probability grounds either. And I thought today would be boring.

Topic		Replies	Views
Is Tanh better than sigmoid? Neural Networks and Deep Learning coursera-platform	5	678	May 11, 2023
Using tanh vs. sigmoid for output layer Neural Networks and Deep Learning coursera-platform	6	777	October 20, 2022
Question about c1w3 quiz Neural Networks and Deep Learning coursera-platform	2	702	October 30, 2021
Why not use tanh-func for output a^L? Neural Networks and Deep Learning coursera-platform	1	513	August 5, 2021
Same cost function for tanh fuction at output layer as sigmoid? Neural Networks and Deep Learning coursera-platform	11	640	January 16, 2022

Tanh and sigmoid are closely related

Related topics