When we use multiple heads as in the tensorflow MultiHeadAttention layer we invoked in the assignment, I would expect that the number of features in the output (‘context’) embedding should increase by num_heads times due to concatenation. Why is that not happening? Is it because we are not explicitly specifying the value_dim (which is the size of each attention head for value) and the output_shape (from TF doc: if not specified, projects back to the query feature dim (the query input’s last dimension))? It that the trick that’s making it work here?
If we specify value_dim, then does that mean that the context embedding dimension will have value_dim*num_heads features?
If we don’t specify value_dim but specify output_shape, does it take the default value of output_shape / num_heads?
If we don’t specify value_dim and output_shape, does it assume output_shape equal to the query input’s embedding size and then value_dim equal to output_shape / num_heads?
Appreciate your help,
Manav.