Course 5, week 1: How is it that -- because the GRU update gate is usually close to 0 -- we do not have a vanishing gradient problem?

A112 · January 14, 2022, 12:29am

Hello everyone!

I might be missing something obvious. While I do understand
the intuition – that Gamma_u near 0 means we can retain
the memory cell for many many time-layers – I was having trouble seeing
this mathematically though.

Concretely, why is it that – because Gamma_u stays close to 0 over
most time-steps – taking the gradient backward over many
layers of time will not produce a vanishing quantity?

Andrew touches on this at roughly 11:04 in the week 1
GRU video, Andrew… He says the following:

"because gamma can be so close to zero, can be 0.000001 or even smaller than that. it doesn’t suffer from much of a vanishing gradient problem because in say gamma so close to zero that this becomes essentially C^t equals C^t minus one and the value of C^t is maintained pretty much exactly even across many times that. This can help significantly with the vanishing gradient problem and therefore allowing your network to learn even very long-range dependencies, such as the cat and was are related even if they are separated by a lot of words in the middle. "

Any intuition would be much appreciated!!

sjfischer · January 14, 2022, 11:59pm

Hi @A112, what prof Ng is trying to explain is that if Gamma_u is close to zero,
then Ct = Gamma-u * cut + (1 - Gamma_u)* Ct-1 = 0 * cut + (1-0)*Ct-1 = Ct-1

So Ct = Ct-1 and as long as Gamma_u stays the same, the value of Ct will be remembered.

See also the snipit from the lecture slide on this.

Hope this clarifies.

A112 · January 17, 2022, 9:58pm

@sjfischer –
That does make sense, and thank you for this comment.

The main thing I cannot see, however is why does the fact that you just pointed out imply that the gradient (taken over many previous time-step layers) will no longer vanish. Do you have thoughts on this question?

Thanks again for any insight you or anyone else might have on this!

sjfischer · January 18, 2022, 8:03pm

Hi @A112 , good question. I am not sure whether I can give a better explanation as is done in the video on GRU. Basically c is the memory cell, gamma is the gate at each layer. At each layer, the gate decides whether to use the old value of the memory, or use the newly calculated prediction in that layer. The newly predicated value (c_ut in the above) can run to zero, but if the gate (gamma) decides that the old memory value is important and has to be used (gamma = 0), then even as the predicated value goes to zero the new memory cell will be the same value as the previous one. Hope that helps.

A112 · January 23, 2022, 1:31am

@sjfischer thank you for your thoughtful answer. I appreciate your help!

David_Farago · June 26, 2022, 7:43am

Note that equality, i.e. the identity function, has the derivative 1. So no explosion and no vanishing even for long-range dependencies.

Topic		Replies	Views
GRU and vanishing gradients Sequence Models coursera-platform	6	669	November 7, 2022
C5_W1’s Quiz: Sarah and Ashely Sequence Models coursera-platform	1	570	September 10, 2023
Sequence Models Week 1 Quiz Sequence Models coursera-platform	15	794	December 4, 2024
GRU and vanishing gradient Sequence Models week-module-1 , coursera-platform	1	31	August 5, 2024
Week1 Quiz doubt Sequence Models coursera-platform	1	509	April 16, 2023

Course 5, week 1: How is it that -- because the GRU update gate is usually close to 0 -- we do not have a vanishing gradient problem?

Related topics