In Week 4 Transformers programming assignment, we create an EncoderLayer class. As the first step in the call(), this passes the input sentence to the instantiated MultiHeadAttention object. i.e.
# calculate self-attention using mha(~1 line)
self_attn_output = self.mha(x, x, attention_mask=mask)
The two occurrences of x (the input sentence) are, as I understand it, because we are doing self-attention in this step. x is paying attention to x.
What I do not understand is how this corresponds to the q, k, v that are meant to be fed into the Attention layer. When we come to the decoder layer, we use this statement:
attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask, return_attention_scores=True)
Here, the same input seems to be used for q, k and v. And in the second decoder block, we have:
attn2, attn_weights_block2 = self.mha2(out1, enc_output, enc_output, padding_mask, return_attention_scores=True)
This implies q is out1, k is enc_output and v is also enc_output. How can the same thing be k and v?
What am I missing here? How does the need for q, k and v correspond to what we are actually feeding to these MultiHeadAttention layers?
Thanks for any enlightenment!
Julian