Prof. Andrew told us that in practice gamma u can be between 0 and 1 but when we look at the formula sigmoid function is applied so how can gamma u be between 0 and 1?

My second question is how is gamma u solving vanishing gradient problem?

Last question: what type of relevance does gamma r tell us?

Sigmoid function (in particular Logistic function is **the** function that “forces” outputs to be in range 0 and 1. Before the sigmoid, the values could range from -\infty to \infty but when passed through the sigmoid function the range becomes [0, 1].

In simple words, the *hidden* state is the thing that helps solving the vanishing gradient problem and the *update* gate values ranging from [0,1] helps to find the balance between the *previous* hidden state and the *new* hidden state.

For example,

- if the update value is 0, this means that the hidden state for next step is completely the
**new**candidate hidden state (everything is updated). - if the update value is 1, this means that the hidden state for next step is completely
**old**hidden state (nothing is updated). - if the update value is 0.5, this means that the hidden state for next step is
**half***old*and**half***new candidate*hidden state (equally weighted).

In other words, it helps the math to work and **balance** between candidate hidden state and old hidden state.

*Reset* gate value is used when calculating *new candidate* state. It’s not very simple to explain but I can try *Reset* value tells us how much of the previous (linearly transformed) hidden state we want to remember **when constructing** the *new* candidate state.

For me, actual calculations make things understand better.

Here is a very simple character lever GRU model (trained) just for illustration (inputs are not embedded, just one hot vectors). Calculation of steps 33 and 34:

*Note, this is a PyTorch version calculations of GRU and the formulas are not exactly the same as in the Course. In particular, PyTorch uses b_ir and b_hr instead of b_r, which eventually is the same thing (b_r = b_ir + b_hr).

What you could see from the calculations:

- that when z_34 at index 9 is 0 (the
*update*value), the new hidden state h_34 changes itself to the value of n_34 (-0.12) completely and does not carry the anything from the previous hidden state h_33 at index 9 (0.19). So the h_34 at index 9 becomes (-0.12) for the next step - a completely new hidden state value (a full copy of the candidate state). - that when z_34 at index 0 is close to 1 (0.94), the new hidden state h_34 retains the value of h_33 (0.84) and becomes (0.82) for the next step - almost unchanged (like in vanilla RNN case).
- the candidate state calculations are more complex to explain:
- first, you calculate r_34 by linearly transforming input x_34 and linearly transforming previous hidden state h_33 (with special weights for
*Reset*gate (both for input and previous hidden state)) and sum those values to get r_34. - next, you calculate n_34 by linearly transforming previous hidden state h_33 (with special weights for previous hidden state when “in”
*Candidate*gate)*but*now multiplying these values with r_34 and just then summing with the linearly transformed input x_34 (again, special weights for input “in”*Candidate*gate); - lastly you apply
*tanh*for the n_34 values to range form -1 to 1 to get the*Candidate*hidden state. So, some values of n_34, like at index 3 become (-0.99) and some like at index 15 become (0.97). And**if**the*update*value is close to 0 (like in the latter case - at index 15), the new hidden state will be changed to be close to candidate state (0.95).

- first, you calculate r_34 by linearly transforming input x_34 and linearly transforming previous hidden state h_33 (with special weights for

In general terms:

- the
*Reset*gate controls how much of the previous state to retain (for candidate state calculations); - the
*Update*gate controls the**balance**between the new candidate state and the old hidden state; - the
*Candidate*state is our way of expressing (calculating) what the the new hidden state could be; the calculations involve the current input, the previous hidden state and*Reset*gate values;

Cheers