Hi @Shiori_YAMASHITA ,
Let me start by the end. You ask: “why you want to prevent vanishing gradients.”?
In general, we want to prevent vanishing gradient because, when gradients vanish, the NN will basically stop learning.
In particular, in RNN, we want also to prevent vanishing gradients because we want the NN to remember the long-range dependencies. How so? Well:
-
Usually RNNs are very long. Lets say 100 layers. Vanishing gradients in earlier layers are very possible because the backprop in these long NN will tend to reduce so much the error from step to step that it may come to almost zero error.
-
This Vanishing gradients makes it hard for the RNN to maintain long term dependencies, like
for example if a verb at the end should be singular or plural, based on a noun at the beginning. It may very well happen that some layers, or say some words, at the end, depend on other words at the beginning. For instance, a verb by the end should be singular or plural, depending on a noun at the beginning (like “cat… was” or “cats … were”), say 90 layers before, to say something.
-
As said, in these long networks, the error calculated in backprop from the end to the start may become almost zero, so the earlier layers may not be affected, hence the NN will not have the ability to ‘remember’ that the cat is singular or plural and the verb should be singular or plural.
Since we want these long NN to remember these long-range dependencies, we want to avoid vanishing gradient.
Now to your first part of the question: How GRUs prevent vanishing gradients?
Well, GRUs are like ‘memory cells’. And in the case of the RNN reviewed in the lecture, these memory cells, called “ct” are equal to “at” (ct = at).
Lets say one of these memory cells is ‘activated’ (set to 1) at the beginning of the network, like, the cat is singular, so lets activate the cell to 1.
GRU will keep this cell ‘active’ for a long time (the value will survive across many many layers), so it will reach the long-range layers of the NN.
This ability to maintain its values in the long term will help the cell to not suffer from the vanishing gradient problem.
In summary, we want to avoid vanishing gradient in RNN to be able to maintain long term relationships among terms, and GRUs help do that because of their ability to keep their value across long neural networks.
Please review these ideas and share any reaction.
Juan