Gated Recurrent Unit gates

Usama_Ahmed1 · September 30, 2023, 7:39pm

Prof. Andrew told us that in practice gamma u can be between 0 and 1 but when we look at the formula sigmoid function is applied so how can gamma u be between 0 and 1?
My second question is how is gamma u solving vanishing gradient problem?
Last question: what type of relevance does gamma r tell us?

arvyzukai · October 1, 2023, 11:30am

Hi @Usama_Ahmed1

Sigmoid function (in particular Logistic function is the function that “forces” outputs to be in range 0 and 1. Before the sigmoid, the values could range from -\infty to \infty but when passed through the sigmoid function the range becomes [0, 1].

In simple words, the hidden state is the thing that helps solving the vanishing gradient problem and the update gate values ranging from [0,1] helps to find the balance between the previous hidden state and the new hidden state.
For example,

if the update value is 0, this means that the hidden state for next step is completely the new candidate hidden state (everything is updated).
if the update value is 1, this means that the hidden state for next step is completely old hidden state (nothing is updated).
if the update value is 0.5, this means that the hidden state for next step is half old and half new candidate hidden state (equally weighted).

In other words, it helps the math to work and balance between candidate hidden state and old hidden state.

Reset gate value is used when calculating new candidate state. It’s not very simple to explain but I can try Reset value tells us how much of the previous (linearly transformed) hidden state we want to remember when constructing the new candidate state.

For me, actual calculations make things understand better.
Here is a very simple character lever GRU model (trained) just for illustration (inputs are not embedded, just one hot vectors). Calculation of steps 33 and 34:

*Note, this is a PyTorch version calculations of GRU and the formulas are not exactly the same as in the Course. In particular, PyTorch uses b_ir and b_hr instead of b_r, which eventually is the same thing (b_r = b_ir + b_hr).

What you could see from the calculations:

that when z_34 at index 9 is 0 (the update value), the new hidden state h_34 changes itself to the value of n_34 (-0.12) completely and does not carry the anything from the previous hidden state h_33 at index 9 (0.19). So the h_34 at index 9 becomes (-0.12) for the next step - a completely new hidden state value (a full copy of the candidate state).
that when z_34 at index 0 is close to 1 (0.94), the new hidden state h_34 retains the value of h_33 (0.84) and becomes (0.82) for the next step - almost unchanged (like in vanilla RNN case).
the candidate state calculations are more complex to explain:
- first, you calculate r_34 by linearly transforming input x_34 and linearly transforming previous hidden state h_33 (with special weights for Reset gate (both for input and previous hidden state)) and sum those values to get r_34.
- next, you calculate n_34 by linearly transforming previous hidden state h_33 (with special weights for previous hidden state when “in” Candidate gate) but now multiplying these values with r_34 and just then summing with the linearly transformed input x_34 (again, special weights for input “in” Candidate gate);
- lastly you apply tanh for the n_34 values to range form -1 to 1 to get the Candidate hidden state. So, some values of n_34, like at index 3 become (-0.99) and some like at index 15 become (0.97). And if the update value is close to 0 (like in the latter case - at index 15), the new hidden state will be changed to be close to candidate state (0.95).

In general terms:

the Reset gate controls how much of the previous state to retain (for candidate state calculations);
the Update gate controls the balance between the new candidate state and the old hidden state;
the Candidate state is our way of expressing (calculating) what the the new hidden state could be; the calculations involve the current input, the previous hidden state and Reset gate values;

Cheers

Anand_Kumar3 · August 17, 2024, 12:49pm

hi @arvyzukai ,

The equation for new hidden state is given by:
ℎ<𝑡>=Γ𝑢∗𝑐<𝑡>+(1−Γ𝑢)∗ℎ<𝑡−1>

If Γ𝑢 = 0, then:
ℎ<𝑡>= ℎ<𝑡−1>

if Γ𝑢 = 1, then:
ℎ<𝑡> = 𝑐<𝑡>

But from the points explained it seems completely opposite:

if the update value is 0, this means that the hidden state for next step is completely the new candidate hidden state (everything is updated).
if the update value is 1, this means that the hidden state for next step is completely old hidden state (nothing is updated).

Am I doing something wrong here ? please help me to understand better ?
Also what is the intuition of using tanh while calculating h’(t1), how it is better than other relu or any other mathematical function ?

Deepti_Prasad · August 17, 2024, 4:46pm

Hi @Anand_Kumar3

I suppose there might be a bit confusion to your understanding,

Read through the comment which explain when gamma u is 0, the initial value is updated and not the new one.

Also the significance of tanh provides a nonlinear response which can capture higher-level patterns or features in the input data when comes to training a neural network.

Regards
DP

Anand_Kumar3 · August 17, 2024, 9:55pm

Hi @Deepti_Prasad ,

your explanation here:

matches with my observation:

But this was mentioned otherwise in @arvyzukai comments and from the results he shared:

That’s what I wanted to check if I was getting it wrong, or the equations has typo in them. Still have doubts on the same.

The equations are taken from here: https://www.coursera.org/learn/sequence-models-in-nlp/supplement/t5L3H/gated-recurrent-units

Anand_Kumar3 · August 20, 2024, 6:07am

Hi @Deepti_Prasad @lucas.coutinho any feedback or suggestions for above message ?

Deepti_Prasad · August 20, 2024, 3:19pm

@Anand_Kumar3

Actually I have checked upon and what @arvyzukai is stating is also seems to correct according to what I found but it does discriminate with DLS video of Andrew Ng explanation, perhaps it is probably because in NLP it works different.

Hey @arvyzukai could you please confirm learners doubt as @lucas.coutinho is busy in other testing courses. I suppose even you are busy, I don’t see you much these days.

regards
DP

Anand_Kumar3 · August 20, 2024, 7:31pm

Hi @Deepti_Prasad @arvyzukai ,

I didn’t get the part as why things might be implemented differently in library.

Does it provide better overall performance or what might be the reason as theory and implementation must be in sync ?

Topic		Replies	Views
GRU and vanishing gradient Sequence Models week-1	1	27	August 5, 2024
Course 5, week 1: How is it that -- because the GRU update gate is usually close to 0 -- we do not have a vanishing gradient problem? Sequence Models	5	559	June 26, 2022
Sequence Models Week 1 Quiz Sequence Models	15	634	December 4, 2024
Is Ct doubling? Sequence Models	12	483	August 20, 2023
C5-W1-quiz GRU question Sequence Models	9	664	May 24, 2022

Gated Recurrent Unit gates

Related topics