Hi @nmurugesh
Yes, this is true - this is how you calculate with how large dimension each head will operate on.
Not entirely true. Each head get its own projection, or in others words, each head gets its own compressed/transformed version of embeddings.
As in my previous point I think you misinterpret the code. I recently answered a similar question which might help illustrate the point that even the operation is single (to get Q, K, V) but the underlying channels of information are separate. Or in other words emb \cdot W_q is a single operation (for efficiency) which produces one output, but later the output is split up (isolated) for different heads.
I would doubt the results if the training is done long enough for a dataset large enough - multi-head should be superior to single head. In other words, if your results are same there must be some underlying reason.
The head dimension is definitely d_feature / n_heads. To convince yourself can :
- check the Attention Is All You Need, in particular:
In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.
- or check other DL frameworks implementations, like pytorch MultiheadAttention documentation, in particular:
- num_heads – Number of parallel attention heads. Note that
embed_dim
will be split acrossnum_heads
(i.e. each head will have dimensionembed_dim // num_heads
).