C5W4 Assignment: Multi-head attention output dimension

When we use multiple heads as in the tensorflow MultiHeadAttention layer we invoked in the assignment, I would expect that the number of features in the output (‘context’) embedding should increase by num_heads times due to concatenation. Why is that not happening? Is it because we are not explicitly specifying the value_dim (which is the size of each attention head for value) and the output_shape (from TF doc: if not specified, projects back to the query feature dim (the query input’s last dimension))? It that the trick that’s making it work here?

If we specify value_dim, then does that mean that the context embedding dimension will have value_dim*num_heads features?

If we don’t specify value_dim but specify output_shape, does it take the default value of output_shape / num_heads?

If we don’t specify value_dim and output_shape, does it assume output_shape equal to the query input’s embedding size and then value_dim equal to output_shape / num_heads?

Appreciate your help,


Please read section 3.2.2 of attention is all you need paper. key_dim and value_dim are dimensions for projecting keys, queries, and values before performing scaled dot product attention across multiple heads. If value_dim is unset, it’s initialized to key_dim.
See the final Linear layer at the top which is responsible for projecting the final output of concatenating individual attention outputs. output_shape refers to the projection dimension for the output. I’ve not had the need to touch this since the default output shape matches the query dimension. If set, MHA output will be of shape (Batch, Query seq. len, output_shape)

Thank you Balaji for the clarification! This makes sense! I wish this were clear from Tensorflow documentation.

Best regards,