C4W2 assignment: query and key dimensions do not match in mha2 of the decoder

victor_popa · March 24, 2024, 1:28pm

In the example demonstrating the DecoderLayer the ecoder_test_output and q have different dimensions ((1,7,8) and (1,15,12) respectively) that do not respect the expectation of MultiHeadAttention that query and key (last) dimensions must match. How come it is still possible to compute MultiHeadAttention in this case without getting an error? How is it done? And how does it change the meaning and the theory explained in the course?

victor_popa · March 24, 2024, 2:13pm

I guess those ecoder_test_output and q must be transformed internally (by some matrix multiplication) such that their transformed versions will have the same last dimension.

Anna_Kay · March 24, 2024, 2:31pm

Hi @victor_popa!

I think Figure 3a: Transformer Decoder layer and Figure 4: Transformer could offer some context for making it clearer where ecoder_test_output and q come from, where they go and which dimensions must match. Hint: there are two MultiHeadAttention blocks in the Decoder.

Do not hesitate to ask if you need more clarifications.

Best

Anna_Kay · March 24, 2024, 3:07pm

Sorry I missed the fact that you refer to mha2 in the title of the post, so I guess you are aware that there are two MultiHead Attention blocks.

Regarding you second post, the answer is yes, transformations are performed in internally for both mha1 and mha2, check out the definitions of these layers.

E.g.,

self.mha1 = tf.keras.layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=embedding_dim,
            dropout=dropout_rate
        )

The key_dim parameter controls the size of each attention head for query and key.

victor_popa · March 24, 2024, 6:37pm

Hi @Anna_Kay !

Thank you so much for the quick replies. With your second answer everything is clear now.

Thanks and best regards!

Topic		Replies	Views
W2A1 Decoder Layer and its test case NLP with Attention Models week-module-2	5	46	April 6, 2025
What does key_min do in tf.keras.layers.MultiHeadAttention? Sequence Models coursera-platform	2	692	August 12, 2022
Week 2 assignment. Encoder dimensions NLP with Attention Models week-module-2	7	60	October 8, 2024
Understanding multi-headed attention - C5W4A1 Sequence Models coursera-platform	1	701	May 10, 2022
Transformer: dimensions of encoder output and decoder Q matrix Sequence Models coursera-platform	1	610	April 21, 2022

C4W2 assignment: query and key dimensions do not match in mha2 of the decoder

Related topics