Significance of sigmoid in an update gate of LSTM cell

Looking at an update gate of an LSTM cell, I cannot grasp the reason for its existence in the first place. I have read several explanations from different sources and all boils down to the sigmoid output being the filter/multiplier for the tanh output. This is fine from an intuitive standpoint, however, it doesn’t explain why a sole tanh activation wouldn’t produce the same output.

In other words, why do we have a tanh activation multiplied by sigmoid activation, and not just tanh activation? Is it easier for tanh * sigmoid to learn than a single tanh?

Hi @gokturk.gezer

You might want to take a look at this thread. In short:

The tanh mask with, -1 or +1 outputs, determines whether to decrement or increment items in the cell state. The sigmoid mask determines whether an item should be updated at all, similarly to the forget gate.



I also found this was asked here previously: LSTM architecture - #2 by anon57530071

It seems the combination of tanh and sigmoid is questioned by others, although from a different angle. I was curious why we need sigmoid, whereas the others question why tanh is needed. Unfortunately, I’m still confused after reading explanations on these threads.

To clarify where I stand, I understand the fundamental function of the sigmoid gate and realize that both activations have different weights. I’m also aware that the forget gate has a similar computation, so my question stands for both.

My trouble stems from the fact that multiplying tanh output with a sigmoid output doesn’t change the output range of the initial tanh function. So why couldn’t a tanh function alone learn to output the same? I understand that would fundamentally change the LSTM so that’s why I’m interested in understanding the mathematical significance of sigmoid, and not its designed purpose.

  • Would it take much more iterations to train a single tanh to learn to output the same state?
  • Would a single tanh, without being multiplied by a sigmoid output, not exhibit the abstract generalization power of LSTMs and end up just overfitting the training set?

Let me offer an overly simplistic analogy:

  • sigmoid - like a normal door - open or closed (1 or 0) - you can go through at the same speed or stop;
  • tanh - like a revolving door - change direction or do not change direction (-1 or 1) - you can go through at the same speed or go backwards at the same speed;

It is “hard” for sigmoid values to be at 0.5, and it is “hard” for tanh values to be at 0, hence the different properties of these functions that are useful.

I’m not a big fan of ML community names so below I provide a very simple calculations for things to be concrete.

If we take the forget gate (and “long” memory C_t), then sigmoid is a better choice - since we either want to forget or not forget (0 or 1; open or closed door).
For example, if at current time step t the combination of input (X_t) and hidden state (H_t) “tells” us that we need to “forget” some values we use sigmoid. As an overly simple concrete example, step 3 - F_3 is [-20, -15, 10, 20], then after sigmoid we get [0, 0, 1, 1], if our C_2 was [-0.7, -0.11, -0.07, 1.06] we would continue calculations with [0, 0, -0.07, 1.06] instead of the whole C_2.

If we take the output gate (used for “short” memory or hidden state H_t), then we use the combination of the two:
Whenever the output gate is close to 1, we allow the “long memory” (C_t) to impact the subsequent layers uninhibited, whereas for output gate values close to 0, we prevent the current memory from impacting other layers of the network at the current time step. Note that a memory cell can accrue information across many time steps without impacting the rest of the network (so long as the output gate takes values close to 0), and then suddenly impact the network at a subsequent time step as soon as the output gate flips from values close to 0 to values close to 1.
Note that tanh is a better choice for “long” memory interaction since it can allows for a more “interesting” multiplication of the two (the subsequent hidden state can have values from -1 to 1, instead of from 0 to 1).
If we continue the overly simple concrete example, and calculate H_3 and assume that output gate (O_3) is for example [0, 1, 0, 1]. And now C_3 got added values from input gate and can have a variety of values (not just only from 0 to 1)), so C_3 could have become [-35, -48, -35, 12] and applying tanh would result in [-1, -1, -1, 1] while sigmoid would have squashed the result more (to [0, 0, 0, 1]). So the resulting vector can have a more “interesting” interaction with output gate with resulting H_3 being [0, -1, 0, 1] (instead of sigmoid version [0, 0, 0, 1]).

In reality values do not go that extreme and allow a “flow” of information instead of just closed or open, or flipped or not flipped.

So in the end to answer your questions concretely:

Most probably or even never.

I’m not sure I understand this question how overfitting is related here - the parameter count and your dataset are the main driving forces for overfitting and not the activation functions. Or am I missing some relations here?

1 Like

Thank you @arvyzukai.
This makes sense. Especially it helped to visualize that tanh and sigmoid going away from certain values faster than the other. Additionally, activation choice could be other than tanh, so having a sigmoid gate makes sense as a generic way to keep or throw away certain states.

As for my comment about overfitting, I think it has a contradiction in itself. I believe when you throw away the sigmoid, you’d be reducing the parameters count of the model, hence decreasing the chances of overfitting.

1 Like

I’m happy to help @gokturk.gezer because these are good questions :+1:

I want to clarify this point - “I believe when you throw away the sigmoid, you’d be reducing the parameters count of the model” - that is not true. Changing or removing an activation function does not change the parameters count - the activation function is applied element-wise to the outputs of the network’s layers and does not affect the number of weights or biases.
Changing the activation function may affect the behavior of the model, potentially leading to different learning dynamics and performance. However, it does not alter the number of weights or biases and, therefore, does not directly impact the parameter count of the model.


When converting tanh * sigmoid to a single tanh, wouldn’t we get rid of weights associated with sigmoid and thus reduce internal parameter count that the model needs to learn?