Course 5, week 1: How is it that -- because the GRU update gate is usually close to 0 -- we do not have a vanishing gradient problem?

Hello everyone!

I might be missing something obvious. While I do understand
the intuition – that Gamma_u near 0 means we can retain
the memory cell for many many time-layers – I was having trouble seeing
this mathematically though.

Concretely, why is it that – because Gamma_u stays close to 0 over
most time-steps – taking the gradient backward over many
layers of time will not produce a vanishing quantity?

Andrew touches on this at roughly 11:04 in the week 1
GRU video, Andrew… He says the following:

"because gamma can be so close to zero, can be 0.000001 or even smaller than that. it doesn’t suffer from much of a vanishing gradient problem because in say gamma so close to zero that this becomes essentially C^t equals C^t minus one and the value of C^t is maintained pretty much exactly even across many times that. This can help significantly with the vanishing gradient problem and therefore allowing your network to learn even very long-range dependencies, such as the cat and was are related even if they are separated by a lot of words in the middle. "

Any intuition would be much appreciated!!

Hi @A112, what prof Ng is trying to explain is that if Gamma_u is close to zero,
then Ct = Gamma-u * cut + (1 - Gamma_u)* Ct-1 = 0 * cut + (1-0)*Ct-1 = Ct-1

So Ct = Ct-1 and as long as Gamma_u stays the same, the value of Ct will be remembered.

See also the snipit from the lecture slide on this.

Hope this clarifies.

image

1 Like

@sjfischer
That does make sense, and thank you for this comment.

The main thing I cannot see, however is why does the fact that you just pointed out imply that the gradient (taken over many previous time-step layers) will no longer vanish. Do you have thoughts on this question?

Thanks again for any insight you or anyone else might have on this!

Hi @A112 , good question. I am not sure whether I can give a better explanation as is done in the video on GRU. Basically c is the memory cell, gamma is the gate at each layer. At each layer, the gate decides whether to use the old value of the memory, or use the newly calculated prediction in that layer. The newly predicated value (c_ut in the above) can run to zero, but if the gate (gamma) decides that the old memory value is important and has to be used (gamma = 0), then even as the predicated value goes to zero the new memory cell will be the same value as the previous one. Hope that helps.

1 Like

@sjfischer thank you for your thoughtful answer. I appreciate your help!

Note that equality, i.e. the identity function, has the derivative 1. So no explosion and no vanishing even for long-range dependencies.