True/False: In order to simplify the GRU without vanishing gradient problems even when training on very long sequences you should always remove the Γu. I.e., setting Γu =0.
Is this ever actually done in practice? It seems a bit extreme as it would force the dependence to be only on the first element and current element of the sequence at each timestep. What if the key information to keep was the second word? We would lose this by hard forcing Γu =0 from the beginning.
Another comment: in the lectures, the softmax to get yhat is applied after the update gate, which if Γu =0 was forced, would mean that each output at all timesteps would only ever depend on the first input… which really does seem odd. Should the softmax to obtain the output at each timestep be applied to Ctilde ?
In the lectures, the “simplified” version is shown to help understand the concept, and I believe, it is mentioned that it is not really used, and that there are many variations to GRU, but almost to the end, Andrew talks about the one which has gamma_r in it; the one which has been widely used after a lot of research (and then again, Andrew also mentions that you can come up with a one on your own).
With that being said, the quiz question is about how to make the GRU simplified, which, as Andrew mentions in the version he has shown, is by making gamma_u = 0.
Your critique here is right, is this used in practice ? Maybe not. But the quiz question is not about whether it can be used or not, but on how to make the GRU more simple.
Hope I have answered your query.
Best,
Mubsi
P.S, since you gave away the answer to the question in your post, I have removed it.
The correct answer I get is by removing gamma_r (setting gamma_r=1). I am very confused by this question, are you able to help explain why other answers are not correct? Thanks.