Multiheaded Attention - Number of heads and Dim of heads

Hi @nmurugesh

Yes, this is true - this is how you calculate with how large dimension each head will operate on.

Not entirely true. Each head get its own projection, or in others words, each head gets its own compressed/transformed version of embeddings.

As in my previous point I think you misinterpret the code. I recently answered a similar question which might help illustrate the point that even the operation is single (to get Q, K, V) but the underlying channels of information are separate. Or in other words emb \cdot W_q is a single operation (for efficiency) which produces one output, but later the output is split up (isolated) for different heads.

I would doubt the results if the training is done long enough for a dataset large enough - multi-head should be superior to single head. In other words, if your results are same there must be some underlying reason.

The head dimension is definitely d_feature / n_heads. To convince yourself can :

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

  • num_heads – Number of parallel attention heads. Note that embed_dim will be split across num_heads (i.e. each head will have dimension embed_dim // num_heads).