While both GRU and LSTM use tanh as the activation function of hidden state, I wonder what makes it a better choice than the sigmoid/relu/etc. According to these posts, post1, post2, the reasons for using tanh are mainly as follows:
- Mitigate the gradient vanishing problem
- Internal state vector whose values should be able to increase or decrease when we add the output of some function
Are these the main reasons that make tanh a better choice? Function like leaky relu should also provides these featues. Is there any advance choice that can replace tanh in RNN (just like we always use relu in general network now)?