How do RNN's learn forget and update gates from a ground truth y?

A general question about RNNs:

How do RNNs learn their forget and update gates, f.i. with sentences about cats from the video?

In other words, how does a RRN learn that “cats” relates to the word “were” later in the sentence and that “cat” relates to “was”?

Since it is supervised learning, is there someone who indicates that these words belong together, and in this way literally learning the RNN the forget and update gates?
Or does the RNN learn these things by itself? The question is then how, since it doesn’t have a “ground truth” value y (as with image recognition for example) of which words belong together.

I understand how mathematically forward and back propagation works, but this keeps puzzling me.

Hey there @Peeteerrr

In RNNs (not simple RNN but LSTM, GRU, etc.), the forget and update gates are learned through the training process without any knowledge on which words should relate to each other. RNNs learn these relationships implicitly from the sequential data they are trained on.

In your example, the RNN learns to associate contexts like “cats” with verb “were” and not “was” by capturing the dependencies and relationships between words through the gradients computed during backpropagation, which optimize the model parameters to minimize prediction errors.

Hope it helps!

Hi Alirez_Saei,

Thanks for your quick and clear response.

So the billions of sentences in the training set are in fact the ground truth (since this is proven to be natural language) for figuring out that for a new sentence, the third word “cats” corresponds to the 10th word “were”?

OK, then I understand it. Thanks for your clarification.


Exactly. There is plenty of “ground truth” in the form of the training set. The RNN of whichever architecture learns everything from back propagation on the training data. If it has forget and update gates as in the LSTM case, then those make the model more powerful or perhaps just make it easier to learn the relationships between the words expressed in the training corpus.

I don’t remember if Prof Ng explicitly addresses this point or not, but I wonder if a plain vanilla RNN can learn the same things that an LSTM can by effectively constructing its own equivalent of forget and update gates using the hidden state. But my intuition is that even if that is in theory possible, the point is that making that part of the architecture explicit just makes it much easier for the training to actually achieve that level of structure. Training cost and time matter greatly when you get to LLM scale, so it’s not good enough for something to be theoretically possible: it has to be achievable with tolerable cost.