I am a bit confused on how GRU works. Using the example in the lecture, "The cat, which already ate …, was full.‘’ since cat and was are closely related. That means my gate value at cat and was will be 1 and 0 for the rest in between. Does that mean my c value stay the same for the word in between? If c value is the same, then my a value for all these words between c and a will be the same, which clearly doesn’t work like that. What is the problem with my understanding?
It’s not that all of c^{<t>} stays the same: remember that all the state and gate values have multiple of bits. Prof Ng uses the example of 100 x 1, but that size is a hyperparameter. So different bits learn to track different things that have happened and are happening or need to happen in the future. We don’t actually know or specify what the functions are or which bits will learn them, but conceptually it can be states like “we have seen the subject of the sentence and it was plural”. The training and back prop figure out what works based on your training data set. The reason that GRU and LSTM are more powerful than the “plain vanilla” RNN is that the gates give a more explicit mechanism for creating complex state that spans the entire length of the input. That makes it easier for the training to learn the patterns that are needed for language or music or whatever the particular application is.