C5 W4 A1: Question about MultiHeadAttention

Julian7 · July 21, 2021, 11:21am

In Week 4 Transformers programming assignment, we create an EncoderLayer class. As the first step in the call(), this passes the input sentence to the instantiated MultiHeadAttention object. i.e.

# calculate self-attention using mha(~1 line)
        self_attn_output = self.mha(x, x, attention_mask=mask)

The two occurrences of x (the input sentence) are, as I understand it, because we are doing self-attention in this step. x is paying attention to x.

What I do not understand is how this corresponds to the q, k, v that are meant to be fed into the Attention layer. When we come to the decoder layer, we use this statement:

attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask, return_attention_scores=True)

Here, the same input seems to be used for q, k and v. And in the second decoder block, we have:

attn2, attn_weights_block2 = self.mha2(out1, enc_output, enc_output, padding_mask, return_attention_scores=True)

This implies q is out1, k is enc_output and v is also enc_output. How can the same thing be k and v?

What am I missing here? How does the need for q, k and v correspond to what we are actually feeding to these MultiHeadAttention layers?

Thanks for any enlightenment!

Julian

TMosh · July 21, 2021, 4:19pm

I edited the thread title to include the week and assignment numbers. It helps the mentors find the right notebook to use for reference.

That’s not correct. For self-attention, you need to use ‘x’ three times (for each of k, q, and v), then the mask.

AffDk · August 4, 2021, 6:37pm

Well, basically I don’t understand why if query, key and value are the same, then this function performs self-attention. As far as I understand, x in self.mha (x, x, x, mask) should a placeholder for the arguments to be fed in during the runtime. right? however, q and v cannot be the same type of arrays as they will have different dimensions.
I basically, do not follow when the Encoder is calling the EncodeLayer

x = self.enc_layers[i](x, training, mask)

supposedly, the input x should be composed of q, k and v. right?
how does it figure out which parts of x represent q, k and v when we just pass the same argument to represent all of them.
I looked at MultiHeadAttention documentation but did not really figure out how it works.
Any hint? or a good source to read?

Aff

Topic		Replies	Views
Q,K,V all are same for self attention Sequence Models coursera-platform	5	671	November 19, 2023
C5 W4 A1 EncoderLayer arguments for self.mha Sequence Models coursera-platform	4	590	May 18, 2023
DLS Course 5 Week 4 Exercise 4 Sequence Models coursera-platform	2	716	June 29, 2021
Programming Assignment: Transformers Architecture with TensorFlow encoderlayer Sequence Models week-module-4 , coursera-platform	2	403	January 23, 2024
Week 4 Encoder Layer Sequence Models coursera-platform	2	730	August 9, 2021

C5 W4 A1: Question about MultiHeadAttention

Related topics