Why tanh is used in RNN to compute hidden state?

kflao · July 4, 2021, 4:18am

While both GRU and LSTM use tanh as the activation function of hidden state, I wonder what makes it a better choice than the sigmoid/relu/etc. According to these posts, post1, post2, the reasons for using tanh are mainly as follows:

Mitigate the gradient vanishing problem
Internal state vector whose values should be able to increase or decrease when we add the output of some function

Are these the main reasons that make tanh a better choice? Function like leaky relu should also provides these featues. Is there any advance choice that can replace tanh in RNN (just like we always use relu in general network now)?

TMosh · July 4, 2021, 4:28am

tanh doesn’t minimize vanishing gradients. Its gradients approach zero for large magnitude inputs.
tanh has limiting values of +1 and -1, so it works well for outputs that are symmetric around zero.
The logistic sigmoid has limits of 0 and 1, so it tends to be used for classification (0=false, and 1=true).

Topic		Replies	Views
Better Activation functions: (tanh > sigmoid) MLS Resources	18	1099	November 10, 2022
Tanh and sigmoid are closely related Neural Networks and Deep Learning coursera-platform	3	880	March 3, 2022
Is Tanh better than sigmoid? Neural Networks and Deep Learning coursera-platform	5	680	May 11, 2023
DL and NN course1 Week#3: Understanding Activation functions Neural Networks and Deep Learning week-module-3 , coursera-platform	2	33	March 4, 2025
Why tanh and sigmoid in forward prop in RNN? Sequence Models coursera-platform	3	510	May 23, 2023

Why tanh is used in RNN to compute hidden state?

Related topics