GRU and vanishing gradients

I don’t quite understand why GRU prevents vanishing gradients.
C^t = gamma * c_tilde^t + (1 - gamma) * c_tilde^(t-1)

It says that C^t is given by the above formula, but when back propagating this,
dc_tilde^t = gamma * dC^t
dc_tilde^(t-1) = (1 - gamma) * dC^t
and both gamma and 1-gamma are less than 1. So it is quite possible that the gradient through this layer is smaller than the gradient before it.

However, with GRU, we can increase the proportion of contributions even in the context of a long time ago, so I can argue that even if we cannot prevent vanishing gradients at all, we at least get larger gradients than without GRU. If so, I can agree. Is there any error in my understanding? I would like to know more in detail about why GRU can prevent vanishing gradients.

Hi @Shiori_YAMASHITA ,

Let me start by the end. You ask: “why you want to prevent vanishing gradients.”?

In general, we want to prevent vanishing gradient because, when gradients vanish, the NN will basically stop learning.

In particular, in RNN, we want also to prevent vanishing gradients because we want the NN to remember the long-range dependencies. How so? Well:

  1. Usually RNNs are very long. Lets say 100 layers. Vanishing gradients in earlier layers are very possible because the backprop in these long NN will tend to reduce so much the error from step to step that it may come to almost zero error.

  2. This Vanishing gradients makes it hard for the RNN to maintain long term dependencies, like
    for example if a verb at the end should be singular or plural, based on a noun at the beginning. It may very well happen that some layers, or say some words, at the end, depend on other words at the beginning. For instance, a verb by the end should be singular or plural, depending on a noun at the beginning (like “cat… was” or “cats … were”), say 90 layers before, to say something.

  3. As said, in these long networks, the error calculated in backprop from the end to the start may become almost zero, so the earlier layers may not be affected, hence the NN will not have the ability to ‘remember’ that the cat is singular or plural and the verb should be singular or plural.

Since we want these long NN to remember these long-range dependencies, we want to avoid vanishing gradient.

Now to your first part of the question: How GRUs prevent vanishing gradients?

Well, GRUs are like ‘memory cells’. And in the case of the RNN reviewed in the lecture, these memory cells, called “ct” are equal to “at” (ct = at).

Lets say one of these memory cells is ‘activated’ (set to 1) at the beginning of the network, like, the cat is singular, so lets activate the cell to 1.

GRU will keep this cell ‘active’ for a long time (the value will survive across many many layers), so it will reach the long-range layers of the NN.

This ability to maintain its values in the long term will help the cell to not suffer from the vanishing gradient problem.

In summary, we want to avoid vanishing gradient in RNN to be able to maintain long term relationships among terms, and GRUs help do that because of their ability to keep their value across long neural networks.

Please review these ideas and share any reaction.

Juan

Oh, no
I The first question you answered was a my typo, sorry for the inconvenience…
I tried to write “I would like to know more in detail about why GRU can prevent vanishing gradients,” but I mistook and led to miscommunication. I am so sorry……

Thank you for your answering my question!

Lets say one of these memory cells is ‘activated’ (set to 1) at the beginning of the network

You said this, but in practice the percentage retained in long-term memory is less than 1 and never reaches 1 (because of the sigmoid function). So I think this algorithm cannot fully avoid vanishing gradients. How about this point?

Once a cell is activated, gamma will be very very close to zero.

Lets look at the formula:

C^t = gamma * c_tilde^t + (1 - gamma) * c_tilde^(t-1)

This formula has 2 parts:
Part 1: gamma * c_tilde^t

Part 2: (1 - gamma) * c_tilde^(t-1)

What happens if gamma is almost zero?

Part 1: gamma * c_tilde^t = 0* c_tilde^t = 0

Part 2: (1 - gamma) * c_tilde^(t-1) = (1 - 0) * c_tilde^(t-1) = c_tilde^(t-1)

So when gamma is almost zero, the cell tends to maintain its value.

Why is gamma so close to zero?

Lets see the formula of gamma:

gamma = sigmoid(Wu[c, X]+bu)

And, as explained by Dr Ng, the expression Wu[c, X]+bu tends to be a large negative.

If we apply sigmoid to a negative number, we get close to zero.

Depending on the learned parameters, gamma can be infinitely close to 0, so the gradient can be maintained, you mean?

I understand, thank you!

Exactly! depending on the learned parameters, gamma can be very close to zero. When gamma is very close to zero then the cell ct will be equal to the previous cell, c, hence it maintains the value across the layers.

I am very glad that this sheds light on the question!

Good luck with the rest of the course!

Juan

1 Like