C5W4 Assignment: Multi-head attention output dimension

manaviitd · January 17, 2024, 8:54am

When we use multiple heads as in the tensorflow MultiHeadAttention layer we invoked in the assignment, I would expect that the number of features in the output (‘context’) embedding should increase by num_heads times due to concatenation. Why is that not happening? Is it because we are not explicitly specifying the value_dim (which is the size of each attention head for value) and the output_shape (from TF doc: if not specified, projects back to the query feature dim (the query input’s last dimension))? It that the trick that’s making it work here?

If we specify value_dim, then does that mean that the context embedding dimension will have value_dim*num_heads features?

If we don’t specify value_dim but specify output_shape, does it take the default value of output_shape / num_heads?

If we don’t specify value_dim and output_shape, does it assume output_shape equal to the query input’s embedding size and then value_dim equal to output_shape / num_heads?

Appreciate your help,
Manav.

balaji.ambresh · January 17, 2024, 6:32pm

Please read section 3.2.2 of attention is all you need paper. key_dim and value_dim are dimensions for projecting keys, queries, and values before performing scaled dot product attention across multiple heads. If value_dim is unset, it’s initialized to key_dim.
See the final Linear layer at the top which is responsible for projecting the final output of concatenating individual attention outputs. output_shape refers to the projection dimension for the output. I’ve not had the need to touch this since the default output shape matches the query dimension. If set, MHA output will be of shape (Batch, Query seq. len, output_shape)

manaviitd · January 18, 2024, 10:52am

Thank you Balaji for the clarification! This makes sense! I wish this were clear from Tensorflow documentation.

Best regards,
Manav.

Topic		Replies	Views
Attention Output shape Sequence Models	9	630	May 11, 2022
C4_W2_Assignment Multi-Head Attention input prep NLP with Attention Models week-2	1	532	September 19, 2022
W2A1 Decoder Layer and its test case NLP with Attention Models week-2	5	10	April 6, 2025
Q about keras doc of tf.keras.layers.MultiHeadAttention Sequence Models	6	560	July 18, 2021
What does key_min do in tf.keras.layers.MultiHeadAttention? Sequence Models	2	619	August 12, 2022

C5W4 Assignment: Multi-head attention output dimension

Related topics