Gradient Descent for Attention Model

Max_Rivera · June 2, 2022, 12:45am

If the weights for a given context are different at each timestep, how do we initialize all the context weights and perform gradient descent when we don’t know how many weights there will be? Inputs and therefore output time steps may vary (that’s the whole point of an RNN) which means the number of context weights will also vary across the training set

anon57530071 · June 2, 2022, 1:46am

If you are talking about different input length, as shown in the original paper, Attention Is All You Need, we usually add “pad” to make it to the fixed length. This makes everything simple.

On the other hand, in the “Transformer” which only uses “Attention” and does not use RNN, there is a mask not to refer paddings.

Max_Rivera · June 2, 2022, 6:55pm

Just to clarify, I’m talking about the RNN encoder/decoder that uses attention, not the Transformer. I’m still having a hard time understanding how we know how many weights there will be ahead of time since the decoder could be variable length as well

TMosh · June 2, 2022, 7:47pm

I’m slightly confused. Can you be more specific which part of the course you’re referring to, by week number and assignment or lecture?

Max_Rivera · June 2, 2022, 7:50pm

Course 5, week 3, video title: Attention Mechanism

What I’m confused about is that we need new context weights at every RNN timestep of the output. Which means that if the input sequence length varies, or the output translation sequence length varies, we will need more context vectors and therefore more weights. How is it possible to train a neural network when the number of weights changes across the dataset…

TMosh · June 2, 2022, 8:09pm

You are correct that you can’t change the number of weights during training. So that’s clearly not what has to happen.

There are eight videos in that section of the course. None have exactly that title. Can you be more specific?

Max_Rivera · June 2, 2022, 8:12pm

Sorry, the video title is ‘Attention Model Intuition’ followed by ‘Attention Model’, although the first video basically has all the information in it that I am questioning.

Max_Rivera · June 2, 2022, 10:04pm

Correct me if I’m wrong, but after further research I understand now that the context weights (alpha) are not learnable, they are predicted by a neural network. So the context weights for timestep t are equal to the softmax output of a neural network that takes as inputs the encoder outputs for all timesteps and the hidden state of the decoder for timestep t - 1 (as shown in the diagram).

Max_Rivera · June 2, 2022, 10:06pm

And if this is correct, my follow up question would be: How is this neural network that computes the softmax able to handle variable length inputs, since it looks to be a standard neural network, not recurrent?

TMosh · June 2, 2022, 10:38pm

I think the key concept is at 2:28 in the video, where Andrew says that Attention only looks at part of the sentence at a time - not the entire variable-length sentence.

More later as I go through the video.

TMosh · June 2, 2022, 10:57pm

The next key point is in the “Attention Model” video at 8:54, when he says that the number of parameters to be learned is the product of Tx and Ty. So in the language translation example, the number of parameters scales with the size of the input sentence and the size of the output sentence.

Sentences that are shorter than the designed maximum Tx are handled using the EOS symbol. Sentences longer than Tx may have to be truncated or split.

Ty would have to be selected based on the characteristics of the language you’re translating.

Max_Rivera · June 3, 2022, 1:53am

Those attention parameters (alpha) that you’re referring to are not learnable. From that video he mentions that there is another neural network used specifically to output a factor ‘e’ for all of the input words which is then fed into a softmax to generate the attention parameter. My question now is: does this network take in all of the input words at once and generate ‘e’ for every word, or do we apply the network on each input word one at a time to generate and then compute the softmax for each word individually? My guess is that the network is applied one at a time, or else it would have to deal with variable length inputs

TMosh · June 3, 2022, 2:20am

That’s the NN I’m referring to.

Topic		Replies	Views
W1 seq2seq lecture question NLP with Attention Models week-module-1	8	343	March 15, 2024
Question about attention weights Sequence Models coursera-platform	3	677	July 31, 2022
Video: NMT Model with Attention NLP with Attention Models week-module-1	5	403	December 21, 2023
Limitation of seq2seq without attention Sequence Models coursera-platform	2	735	June 5, 2022
The Matrix Math for self-attention Attention in Transformers: Concepts and Code in Py	4	106	February 22, 2025

Gradient Descent for Attention Model

Related topics