Doubt in GRU architecture

I have one doubt in the architecture of GRU. The full GRU cell uses two gates: reset gate and update gate. However, I am not able to understand the intuition behind using two gates.

Reset gate controls how much of previous hidden state should be forgotten. Thus, the reset gate tries to balance both current input and previous hidden state. The candidate i.e. c_tilde generated is now a balance of both previous hidden state and current input.

But even this c_tilde is now balanced with previous hidden state using update gate. Why do we need to rebalance c_tilde again using update gate to obtain c? Wont just reset gate do the job of preserving previous context and current input?

The equations are here for reference:
Screenshot 2024-08-15 114520

If your question is from a specific course, please post it in the forum area for that course.

You can move your thread by using the “pencil” icon in the thread title. All of the course forums are in the “Course Q&A” area.

1 Like

It’s been several years since I watched the lectures in this course, so I forget exactly what Prof Ng says when he explains the GRU architecture. But here’s my take on your question: think of it as two different “knobs” that can be turned to affect how the cell processes each timestep.

The reset gate controls the persistence of past occurrences in the sequence, e.g. whether the subject of the sentence was singular or plural. How long should that be remembered?

The update gate controls how much effect both the remembered state and the most recent hidden state has on what happens at the current timestep. Maybe the word we are handling is not a verb, so it’s not affected by whether the subject was singular or plural. Just because we remember some past attribute doesn’t mean it affects the interpretation of every following word in the sentence: some it does affect and some it may not.

Having those two separate properties that the network can learn may help it be more “expressive”. Of course even a plain vanilla RNN could in principle learn the same thing just with the simple version of hidden state, but having those “knobs” made explicit in the architecture of the network makes it easier for the training to achieve the level of performance that we seek.

My guess is that Prof Ng will have said something about this in the lectures as well, so it might be worth listening again with the above thoughts in mind.

Then of course we can take this architectural idea even further with LSTM, which Prof Ng will explain next.

Oh, sorry, I second Tom’s point about filing this in the appropriate category, if you’re asking about a specific course. I thought I recognized the slide you show and just jumped to the conclusion that you are talking about DLS Course 5 Week 1, but this material is probably also covered in NLP Course 3.

My reference to lectures was intended to mean the contents of DLS C5 W1.