Gradient Descent for Attention Model

If the weights for a given context are different at each timestep, how do we initialize all the context weights and perform gradient descent when we don’t know how many weights there will be? Inputs and therefore output time steps may vary (that’s the whole point of an RNN) which means the number of context weights will also vary across the training set

If you are talking about different input length, as shown in the original paper, Attention Is All You Need, we usually add “pad” to make it to the fixed length. This makes everything simple.

On the other hand, in the “Transformer” which only uses “Attention” and does not use RNN, there is a mask not to refer paddings.

Just to clarify, I’m talking about the RNN encoder/decoder that uses attention, not the Transformer. I’m still having a hard time understanding how we know how many weights there will be ahead of time since the decoder could be variable length as well

I’m slightly confused. Can you be more specific which part of the course you’re referring to, by week number and assignment or lecture?

Course 5, week 3, video title: Attention Mechanism

What I’m confused about is that we need new context weights at every RNN timestep of the output. Which means that if the input sequence length varies, or the output translation sequence length varies, we will need more context vectors and therefore more weights. How is it possible to train a neural network when the number of weights changes across the dataset…

You are correct that you can’t change the number of weights during training. So that’s clearly not what has to happen.

There are eight videos in that section of the course. None have exactly that title. Can you be more specific?

1 Like

Sorry, the video title is ‘Attention Model Intuition’ followed by ‘Attention Model’, although the first video basically has all the information in it that I am questioning.

Correct me if I’m wrong, but after further research I understand now that the context weights (alpha) are not learnable, they are predicted by a neural network. So the context weights for timestep t are equal to the softmax output of a neural network that takes as inputs the encoder outputs for all timesteps and the hidden state of the decoder for timestep t - 1 (as shown in the diagram).

And if this is correct, my follow up question would be: How is this neural network that computes the softmax able to handle variable length inputs, since it looks to be a standard neural network, not recurrent?

I think the key concept is at 2:28 in the video, where Andrew says that Attention only looks at part of the sentence at a time - not the entire variable-length sentence.

More later as I go through the video.

The next key point is in the “Attention Model” video at 8:54, when he says that the number of parameters to be learned is the product of Tx and Ty. So in the language translation example, the number of parameters scales with the size of the input sentence and the size of the output sentence.

Sentences that are shorter than the designed maximum Tx are handled using the EOS symbol. Sentences longer than Tx may have to be truncated or split.

Ty would have to be selected based on the characteristics of the language you’re translating.

Those attention parameters (alpha) that you’re referring to are not learnable. From that video he mentions that there is another neural network used specifically to output a factor ‘e’ for all of the input words which is then fed into a softmax to generate the attention parameter. My question now is: does this network take in all of the input words at once and generate ‘e’ for every word, or do we apply the network on each input word one at a time to generate and then compute the softmax for each word individually? My guess is that the network is applied one at a time, or else it would have to deal with variable length inputs

That’s the NN I’m referring to.