Significance of sigmoid in an update gate of LSTM cell


I also found this was asked here previously: LSTM architecture - #2 by anon57530071

It seems the combination of tanh and sigmoid is questioned by others, although from a different angle. I was curious why we need sigmoid, whereas the others question why tanh is needed. Unfortunately, I’m still confused after reading explanations on these threads.

To clarify where I stand, I understand the fundamental function of the sigmoid gate and realize that both activations have different weights. I’m also aware that the forget gate has a similar computation, so my question stands for both.

My trouble stems from the fact that multiplying tanh output with a sigmoid output doesn’t change the output range of the initial tanh function. So why couldn’t a tanh function alone learn to output the same? I understand that would fundamentally change the LSTM so that’s why I’m interested in understanding the mathematical significance of sigmoid, and not its designed purpose.

  • Would it take much more iterations to train a single tanh to learn to output the same state?
  • Would a single tanh, without being multiplied by a sigmoid output, not exhibit the abstract generalization power of LSTMs and end up just overfitting the training set?