Thanks!

I also found this was asked here previously: LSTM architecture - #2 by anon57530071

It seems the combination of **tanh** and **sigmoid** is questioned by others, although from a different angle. I was curious why we need sigmoid, whereas the others question why tanh is needed. Unfortunately, I’m still confused after reading explanations on these threads.

To clarify where I stand, I understand the fundamental function of the sigmoid gate and realize that both activations have different weights. I’m also aware that the forget gate has a similar computation, so my question stands for both.

My trouble stems from the fact that multiplying **tanh** output with a **sigmoid** output doesn’t change the output range of the initial **tanh** function. So why couldn’t a **tanh** function alone learn to output the same? I understand that would fundamentally change the LSTM so that’s why I’m interested in understanding the mathematical significance of sigmoid, and not its designed purpose.

- Would it take much more iterations to train a single
**tanh**to learn to output the same state? - Would a single
**tanh**, without being multiplied by a**sigmoid**output, not exhibit the abstract generalization power of LSTMs and end up just overfitting the training set?