Attention Models Lecture

  • At 4:56 in the “Attention Model” video, professor draws 2 inputs to an RNN block. My understanding is that y=x and s=a, which was the short term memory input to the RNN block. But what RNN input is c?
  • how do we express a penalty for the distance (in number of input tokens) when computing attention to give to a<t’>? I would imagine that alpha would decrease the weight from a word that is further away than a word that is closer. The formula for alpha computation involves a neural network that takes previous state and a<t’> as inputs, but doesnt include an input about distance of the token that generated that activation, making it look like it is only looking at token content rather than token distance.
    maybe a<t’> would implicitly include a counter from the beginning of the sentence?

Andrew discusses this at 3:16 in that video.

Can you give some reference for why you believe distance is used here?

No, that is not the case.

No he doesn’t. I watched the video multiple times and didn’t understand, which is why I am asking. It would be more clear if you could directly explain it please. To reiterate, my understanding of LSTMs is that there are 3 inputs: previous cell state (c), short term memory activations (a) and input (x), and I don’t understand which which data (short term memory, context, previous unit output) goes to which input.

Intuitively, it seems that if an input word t’ is further away from the word we are trying to generate t, it should have a smaller attention. I would have expected to see some sort of penalty in the alpha calculation if the word is far away (ie |t-t’| is large).

Here is a good summary from the lecture:


He then goes on from that time stamp to describe where the alpha weights come from.

It’s not a distance, its from a “small neural network”.

So I think I might have been asking a question getting ahead of the lecture, not knowing that I was going to have to figure it out in the programming exercise. For anyone else who has the same question, context becomes the input to the post-attention LSTM.

Also, why do we need to have shared common weights for all the LSTM units in a given layer? Are there ever RNNs where each unit has unique weights?