I got the version of the question with Alice and Betty proposing simplified versions of GRU but the answers are pretty confusing to me. Why are the choices all talking about the other person’s modification? For example for the two choices that pick Alices’s model (setting gamma_u=0), the second half of the choice then talks about the consequences of gamme_r being either 0 or 1.

Hey @abelian_group_chen,

The options aren’t referring to the other person’s choices. I guess this is a viable confusion, so let’s clear it right away. Let’s say we consider the first option:

Alice’s model (removing \Gamma_u), because if \Gamma_r \approx 0 for a timestep, the gradient can propagate back through that timestep without much decay.

As per the question, \Gamma_u = 0, so you can’t change it any possible manner. And hence, this option tries to ask you what will happen when the other variable, i.e., \Gamma_r will take on different values, for instance 0 for this particular option.

So, all the options have one of the two variables fixed as defined in the question, and tries to test your understanding as to what will happen when we will vary the other variable. Will the pair of values satisfy the conditions mentioned in the question or not? I hope this helps.

Regards,

Elemento

Thanks Elemento, I am still a bit confused to the choices. We know that gamma_u needs to be zero so c_t and c_(t-1) are highly correlated, but in this case, two choices offer this option in slightly different ways. Choosing Alice’s model will have gamma_u = 0 for sure, but Betty’s model might also benefit when gamma_u is approx 0. I chose the option corresponding to the first option (Alice’s model) and was marked wrong.

Hey @abelian_group_chen,

Please check your DM.

Regards,

Elemento

Hi @Elemento I have a doubt. If we set gamma_u = 0, then it doesn’t matter what gamma_r is, because gamma_u always = 0 and this implies c_t always = c_(t-1). Hence, Alice model (with gamma_u = 0 and gamma_r can be either 0 or 1 doesn’t matter) should be the answer according to me. As, gamma_u = 0 means no effect of c_tilde_t , hence no effect of gamma_r. Let me know where I’m going wrong.

Hey @iamcalledayush,

Welcome, and we are glad that you could become a part of our community

Please check your DM.

Cheers,

Elemento

Could someone explain this to me as well.

Isn’t the whole point of setting gamma_0 = 0 to allow the gradient to more easily back propogate. Isn’t the 1 - gamma_u * c^{t-1} essentially the part that prevents vanishing gradients?

That is one benefit, but the main purpose is to allows the gate to fire at some later time sequence, to help give context to different parts of the input.

That’s what Andrew means in the “GRU (simplified)” lecture at 11:40 when he talks about learning the dependencies.

I have come to the same conclusion, and of course it just got marked as “incorrect” :-(.

Is there any explanation of what the question really asks for and why this is wrong? After all, the question only asks for when it will “work without vanishing gradient problems”.