LSTM - some fundamental question about the weights of Forget and Update Gates

I have some fundamental question about the weights of Forget gate and Update gate (i.e. Wf and Wu).

In the video, we use the example “The cat, which ate already, was full.” My understanding is that, through training using this sample, Wf and Wu will eventually learn that the word at time-step 2 (cat) has more influence on the activation at time-step 6 (was).

However, each sample in a training corpus is a different case. After we have trained a LSTM-based language model using the corpus, how is it possible that the trained Wf and Wu can correctly handle all the difference cases?

I know my question probably exposes my lack of understanding about RNN or even DL at some fundamental level, but I hope someone can shed some light so that I can get some breakthrough. :slight_smile:

It seems like you are looking for details of the backward pass. If this is the case, please do the optional exercise in the programming assignment where you’ll implement the backward pass of an LSTM.

Thanks, but my question is more conceptual.

Let me rephrase my question: :slight_smile:

  1. The learned Wf and Wu control the flow of the Cell state (based on hidden, x and cell)
  2. But how the Cell state should flow (in order to produce the desired output) can be different from sample to sample. E.g. sample 1: “The cat, which ate already, was full.” Sample 2: “The cat was full because it ate already.”
  3. After training, Wf and Wu can handle all the different cases correctly, despite point no. 2 above. I can’t fathom how this can be, and this is the conceptual block I have.

I’ve asked other mentors to reply to this topic since I can’t seem to offer a better explanation than telling you that the layer learns via training.

Does this help? Understanding LSTM Networks -- colah's blog

1 Like

Thanks a lot for trying to help. I read that article already (and a few other good ones out there). All those articles explained the mechanism of LSTM, but none of them offer any insight on how to interpret the learned Wf and Wu .

This article is also a good one, giving good intuition about the meaning of hidden state and cell in LSTM. However, just like other articles, it explains how Wf and Wu was updated based on one sentence. But the Wf and Wu, after trained using a whole corpus of samples, can still be effective remains a mystery in our mind.

I think at this stage I just have to accept that “it works”.

Hey @Patrick_Ng,
Let me try to give my 2 cents. I will try to share some insights about either of these gates, and the same insights can be extended to the other one. Let’s say that we pick the forget gate. Let’s consider the equation of forget gate

\Gamma_f = \sigma(W_f[a^{<t-1>}, x^{<t>}] + b_f)

I agree with your statement that after learning W_f is a fixed set of parameters and is shared among all the time-steps, but if we think about it closely, it’s a matrix, and the \Gamma_f which is used in the equations along with the other gates to produce the output, is not just dependent on W_f alone, it is dependent on the matrix product of W_f with the input and the previous cell state. So, you see though, W_f (the matrix) may be a fixed set of parameters, the matrix product may vary example to example, thereby adjusting the output, i.e., \Gamma_f to be suitable for every example, be it one of the examples from the training set, or one of the examples from the test set (as long as it is similar to the train set examples).

In fact, if we try to relate it, it’s similar to how any typical neural network functions. The weights learnt by the network are constant once the training is complete, it is when different inputs are forward propagated through these “fixed” weights, that the network produces different outputs. So, if we try to form an analogy, W_f is the fixed set of weights (after learning has completed), a^{<t-1>} and x^{<t>} are the different inputs, and \Gamma_f is the different output (adjusted accordingly).

Additionally, one other aspect, which I am always grateful for is the fact that \Gamma_f is a vector, and not a scalar value. This means, we can extend the size of the gates to hold in as much information as we would like to, assuming an ideal scenario of infinite computation.

Let us know if this helps you anyhow, and then, we will discuss further upon.

Cheers,
Elemento

1 Like

Elemento’s explanation should help at the level of understanding how it can generate different results from different inputs. But at the higher conceptual level, consider how our brains can process language. Maybe the LSTM’s learned “knowledge” actually includes things like the ability to recognize certain words as nouns which are a) positioned to be the subject of the sentence and b) it can recognize whether they are singular or plural. And it can do this even when the sentence starts with a subordinate clause.

At some level, I agree that it all seems like “magic”, but why is this any more surprising than the fact that the simple fully connected networks we learned about in Course 1 can recognize a cat in an image just based on the “unrolled” vector of pixels from a 2D image? You’d think that the flattening would destroy the geometric relationships in the image, but somehow it doesn’t. And the pixel color values are just numbers. If you added 3 to the green value of every pixel, it would change the colors, but the network would still be able to recognize the more greenish cats.

1 Like

@balaji.ambresh @Elemento @paulinpaloalto
My deep thanks to all of your help and insights! They help me to look at this at different angles, and the comparison with the NN for detecting the “green cat” helps a lot. At some level it all sounds like a magical black box to me, but I hope that after more study and practice these things will sink in more and feel less magical. (and if one day I can get to understand this paper about how to visualize RNN, that should a lot too.) :slightly_smiling_face:

Again, thanks a lot!

1 Like

Hi, Patrick.

Thanks very much for the link to the Karpathy paper! I had not heard of that before and it sounds totally relevant to your question and to understanding how LSTMs actually accomplish what they do. I’ve only read the Abstract so far, but look forward to trying to read further and see if I can understand what they are talking about. Andrej Karpathy has written some great blog posts about RNNs and LSTMs that I’ve seen in the past.

Onward! :nerd_face:

1 Like