I am confused about the gates in the GRU model. If Gamma = 0, then c<t> = c<t-1>. Since a<t> = c<t>, then we have a<t> = a<t-1>. This seems strange to me.
How are we capturing long term dependencies when the net-effect is simply that the value of a<t> doesn’t change over time?
You’re right that if \Gamma_u is always 0, then things are not very interesting. But the point is that we are training the network and both the \Gamma_u and \Gamma_r values are controlled by learned parameters (weight and bias values), as given in the formulas Prof Ng shows in the lectures. If the training produces something that uninteresting, then there is something wrong with our approach. Either our training data is not expressive enough or we’ve picked the wrong architecture for the GRU network (bad hyperparameter choices) or maybe both.