Attention Models Lecture

neilsikka · March 10, 2023, 6:11pm

At 4:56 in the “Attention Model” video, professor draws 2 inputs to an RNN block. My understanding is that y=x and s=a, which was the short term memory input to the RNN block. But what RNN input is c?
how do we express a penalty for the distance (in number of input tokens) when computing attention to give to a<t’>? I would imagine that alpha would decrease the weight from a word that is further away than a word that is closer. The formula for alpha computation involves a neural network that takes previous state and a<t’> as inputs, but doesnt include an input about distance of the token that generated that activation, making it look like it is only looking at token content rather than token distance.
maybe a<t’> would implicitly include a counter from the beginning of the sentence?

TMosh · March 12, 2023, 5:22am

Andrew discusses this at 3:16 in that video.

Can you give some reference for why you believe distance is used here?

No, that is not the case.

neilsikka · March 20, 2023, 3:20pm

No he doesn’t. I watched the video multiple times and didn’t understand, which is why I am asking. It would be more clear if you could directly explain it please. To reiterate, my understanding of LSTMs is that there are 3 inputs: previous cell state (c), short term memory activations (a) and input (x), and I don’t understand which which data (short term memory, context, previous unit output) goes to which input.

Intuitively, it seems that if an input word t’ is further away from the word we are trying to generate t, it should have a smaller attention. I would have expected to see some sort of penalty in the alpha calculation if the word is far away (ie |t-t’| is large).

TMosh · March 25, 2023, 4:40am

Here is a good summary from the lecture:

He then goes on from that time stamp to describe where the alpha weights come from.

It’s not a distance, its from a “small neural network”.

neilsikka · March 29, 2023, 8:36pm

So I think I might have been asking a question getting ahead of the lecture, not knowing that I was going to have to figure it out in the programming exercise. For anyone else who has the same question, context becomes the input to the post-attention LSTM.

Also, why do we need to have shared common weights for all the LSTM units in a given layer? Are there ever RNNs where each unit has unique weights?

Topic		Replies	Views
Attention model video formula error? Sequence Models coursera-platform	3	563	December 15, 2021
Understanding of basic Attention code NLP with Attention Models week-module-1	3	585	August 13, 2023
Attention sequence model week3(make post attention steps output depend on prior step) Deep Learning Resources	1	100	October 7, 2022
Quadratic cost of Attention Sequence Models week-module-3 , coursera-platform	1	41	September 30, 2024
Gradient Descent for Attention Model Sequence Models coursera-platform	12	617	June 3, 2022

Attention Models Lecture

Related topics