Gated Recurrent Unit gates

Prof. Andrew told us that in practice gamma u can be between 0 and 1 but when we look at the formula sigmoid function is applied so how can gamma u be between 0 and 1?
My second question is how is gamma u solving vanishing gradient problem?
Last question: what type of relevance does gamma r tell us?

Hi @Usama_Ahmed1

Sigmoid function (in particular Logistic function is the function that “forces” outputs to be in range 0 and 1. Before the sigmoid, the values could range from -\infty to \infty but when passed through the sigmoid function the range becomes [0, 1].

In simple words, the hidden state is the thing that helps solving the vanishing gradient problem and the update gate values ranging from [0,1] helps to find the balance between the previous hidden state and the new hidden state.
For example,

  • if the update value is 0, this means that the hidden state for next step is completely the new candidate hidden state (everything is updated).
  • if the update value is 1, this means that the hidden state for next step is completely old hidden state (nothing is updated).
  • if the update value is 0.5, this means that the hidden state for next step is half old and half new candidate hidden state (equally weighted).

In other words, it helps the math to work and balance between candidate hidden state and old hidden state.

Reset gate value is used when calculating new candidate state. It’s not very simple to explain but I can try :slight_smile: Reset value tells us how much of the previous (linearly transformed) hidden state we want to remember when constructing the new candidate state.


For me, actual calculations make things understand better.
Here is a very simple character lever GRU model (trained) just for illustration (inputs are not embedded, just one hot vectors). Calculation of steps 33 and 34:

*Note, this is a PyTorch version calculations of GRU and the formulas are not exactly the same as in the Course. In particular, PyTorch uses b_ir and b_hr instead of b_r, which eventually is the same thing (b_r = b_ir + b_hr).

What you could see from the calculations:

  • that when z_34 at index 9 is 0 (the update value), the new hidden state h_34 changes itself to the value of n_34 (-0.12) completely and does not carry the anything from the previous hidden state h_33 at index 9 (0.19). So the h_34 at index 9 becomes (-0.12) for the next step - a completely new hidden state value (a full copy of the candidate state).
  • that when z_34 at index 0 is close to 1 (0.94), the new hidden state h_34 retains the value of h_33 (0.84) and becomes (0.82) for the next step - almost unchanged (like in vanilla RNN case).
  • the candidate state calculations are more complex to explain:
    • first, you calculate r_34 by linearly transforming input x_34 and linearly transforming previous hidden state h_33 (with special weights for Reset gate (both for input and previous hidden state)) and sum those values to get r_34.
    • next, you calculate n_34 by linearly transforming previous hidden state h_33 (with special weights for previous hidden state when “in” Candidate gate) but now multiplying these values with r_34 and just then summing with the linearly transformed input x_34 (again, special weights for input “in” Candidate gate);
    • lastly you apply tanh for the n_34 values to range form -1 to 1 to get the Candidate hidden state. So, some values of n_34, like at index 3 become (-0.99) and some like at index 15 become (0.97). And if the update value is close to 0 (like in the latter case - at index 15), the new hidden state will be changed to be close to candidate state (0.95).

In general terms:

  • the Reset gate controls how much of the previous state to retain (for candidate state calculations);
  • the Update gate controls the balance between the new candidate state and the old hidden state;
  • the Candidate state is our way of expressing (calculating) what the the new hidden state could be; the calculations involve the current input, the previous hidden state and Reset gate values;

Cheers