In GRUs, a=C, hence if C is kept preserved over multiple further RNN blocks, isn’t a direct consequence of that, a will also be preserved, which is a big problem?
For example, consider a word predictor for the example Andrew uses:
"The cats, which already …, were full.
As the GRU proceeds to predicts and reaches cats, it decides it must preserve this activation/C so how does it go ahead with other predictions like which and already, because for both the activation entering is the same.
EDIT: I rewatched the video, and what I have understood is since C is a vector you can have only say one value representing the singular/plural saved and the other activations can change. Is this understanding correct, and is this what we observe in practice as well?
Hey,
my understanding is that in a regular RNN, the activation at sequence t theorically should have all the info regarding the previous one t-1, t-2, …, 0.
But in practice the issues is when t is big (in a large sentence for example) the info within the very first activations can hardly propagate all the way through.
In the GRU however, by adding a term (1-gamme_u).*c< t-1 > (or (1-Gamme_u).*a< t-1 > ) it helps info that are “important” from previous timesteps propagating all the way through.
Gamme_u is what makes the activations different for each timestep, and it depends on x < t >, I feel like it should be noted Gamme_u < t > to make it less confusing
1 Like
Hey, thanks for your reply.
Is that so. As in will Gamma_u really be different in each, the reason I am confused is because Andrew writes that for the plural activation to be saved, it have to be continuously near 0 as he has written in the bottom left. Or was that an extreme case?
to be sure I just rewatched the video, at 12:45 he says: in practice gamma won’t be exactly zero or one.
I think the case where it very close to 0 should be a special case.
That means the word t2 only depends on t1 (supposing all Gamma_u = 0 for any t between t2 and t1)
In the example of the video “the cats… were full”. If we say x(t2) = “were” and x(t1)=“cats”, this might be true in some sense “were” only depends on the state of “cats”.
But for the words that depends on many other words, then C should keep in memory many steps, and so Gamme_u won’t be zero or one.
Hey, so you’re right gamma_u can take any value between 0 or 1, Andrew asks to consider it to be exactly 0 or 1 for intuition purposes. So one can literally imagine the plural activation all the way through.
The explanation I have landed on is since gamma_u isn’t a single number but a vector the same size as the activations, i.e you have a gate for each value of the activations, which one you want to preserve and by what factor.
So gamma_u allows a path to completely save one particular activation from the previous layer and keep saving it. by staying 0 or 1.
The confusion I had was I thought the activations would be completely same as the previous in the 0 or 1 case. But this is solved by understanding all activations and gates are vectors. So the network can learn to save only one particular value but change all others by any degree.
I’d appreciate if a mentor could close this up for us. @paulinpaloalto @TMosh @kenb
Hi, it’s me again,
The thing is if Gamma_u is 0 (namely the zero vector) , then it’s just the same activation as the previous time step. So for anything to change, gamma_u must not be 0 (vector). But I agree that it is possible some of the elements in Gamma_u are close to zeros.
However as it acts on activations, it’s hard to say which component tracks what, because I think all the info (tracked by the activations) are kinda mixed and distributed to all the elements.
But yea, I’d like to hear other people thoughts as well.
Yes, exactly. You’re right. That’s why I wanted to know if in practice do some go super close to zeros or ones, or is everything usually grey.