How do RNN's learn forget and update gates from a ground truth y?

Peeteerrr · July 9, 2024, 12:18pm

A general question about RNNs:

How do RNNs learn their forget and update gates, f.i. with sentences about cats from the video?

In other words, how does a RRN learn that “cats” relates to the word “were” later in the sentence and that “cat” relates to “was”?

Since it is supervised learning, is there someone who indicates that these words belong together, and in this way literally learning the RNN the forget and update gates?
Or does the RNN learn these things by itself? The question is then how, since it doesn’t have a “ground truth” value y (as with image recognition for example) of which words belong together.

I understand how mathematically forward and back propagation works, but this keeps puzzling me.

Alireza_Saei · July 9, 2024, 12:49pm

Hey there @Peeteerrr

In RNNs (not simple RNN but LSTM, GRU, etc.), the forget and update gates are learned through the training process without any knowledge on which words should relate to each other. RNNs learn these relationships implicitly from the sequential data they are trained on.

In your example, the RNN learns to associate contexts like “cats” with verb “were” and not “was” by capturing the dependencies and relationships between words through the gradients computed during backpropagation, which optimize the model parameters to minimize prediction errors.

Hope it helps!

Peeteerrr · July 9, 2024, 1:04pm

Hi Alirez_Saei,

Thanks for your quick and clear response.

So the billions of sentences in the training set are in fact the ground truth (since this is proven to be natural language) for figuring out that for a new sentence, the third word “cats” corresponds to the 10th word “were”?

OK, then I understand it. Thanks for your clarification.

paulinpaloalto · July 9, 2024, 3:43pm

Exactly. There is plenty of “ground truth” in the form of the training set. The RNN of whichever architecture learns everything from back propagation on the training data. If it has forget and update gates as in the LSTM case, then those make the model more powerful or perhaps just make it easier to learn the relationships between the words expressed in the training corpus.

I don’t remember if Prof Ng explicitly addresses this point or not, but I wonder if a plain vanilla RNN can learn the same things that an LSTM can by effectively constructing its own equivalent of forget and update gates using the hidden state. But my intuition is that even if that is in theory possible, the point is that making that part of the architecture explicit just makes it much easier for the training to actually achieve that level of structure. Training cost and time matter greatly when you get to LLM scale, so it’s not good enough for something to be theoretically possible: it has to be achievable with tolerable cost.

Topic		Replies	Views
Understanding the Mechanisms of Sequence Prediction Sequence Models	1	505	June 17, 2023
How can LSTM or GRU decided what to forget or remember? Sequence Models	3	553	July 25, 2022
Understanding of LSTM NLP with Sequence Models week-3	7	478	June 21, 2023
LSTM - some fundamental question about the weights of Forget and Update Gates Sequence Models	8	555	December 24, 2022
Week 1 - Quiz Problem Sequence Models week-1	1	287	January 20, 2024

How do RNN's learn forget and update gates from a ground truth y?

Related topics