C4W2 assignment: query and key dimensions do not match in mha2 of the decoder

In the example demonstrating the DecoderLayer the ecoder_test_output and q have different dimensions ((1,7,8) and (1,15,12) respectively) that do not respect the expectation of MultiHeadAttention that query and key (last) dimensions must match. How come it is still possible to compute MultiHeadAttention in this case without getting an error? How is it done? And how does it change the meaning and the theory explained in the course?

1 Like

I guess those ecoder_test_output and q must be transformed internally (by some matrix multiplication) such that their transformed versions will have the same last dimension.

1 Like

Hi @victor_popa!

I think Figure 3a: Transformer Decoder layer and Figure 4: Transformer could offer some context for making it clearer where ecoder_test_output and q come from, where they go and which dimensions must match. Hint: there are two MultiHeadAttention blocks in the Decoder.

Do not hesitate to ask if you need more clarifications.

Best

1 Like

Sorry I missed the fact that you refer to mha2 in the title of the post, so I guess you are aware that there are two MultiHead Attention blocks.

Regarding you second post, the answer is yes, transformations are performed in internally for both mha1 and mha2, check out the definitions of these layers.

E.g.,

self.mha1 = tf.keras.layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=embedding_dim,
            dropout=dropout_rate
        )

The key_dim parameter controls the size of each attention head for query and key.

2 Likes

Hi @Anna_Kay !

Thank you so much for the quick replies. With your second answer everything is clear now.

Thanks and best regards!

2 Likes