Why tanh is used in RNN to compute hidden state?

While both GRU and LSTM use tanh as the activation function of hidden state, I wonder what makes it a better choice than the sigmoid/relu/etc. According to these posts, post1, post2, the reasons for using tanh are mainly as follows:

  1. Mitigate the gradient vanishing problem
  2. Internal state vector whose values should be able to increase or decrease when we add the output of some function

Are these the main reasons that make tanh a better choice? Function like leaky relu should also provides these featues. Is there any advance choice that can replace tanh in RNN (just like we always use relu in general network now)?

tanh doesn’t minimize vanishing gradients. Its gradients approach zero for large magnitude inputs.
tanh has limiting values of +1 and -1, so it works well for outputs that are symmetric around zero.
The logistic sigmoid has limits of 0 and 1, so it tends to be used for classification (0=false, and 1=true).