Multiheaded Attention - Number of heads and Dim of heads

arvyzukai · April 28, 2023, 3:37pm

Yes, this is true - this is how you calculate with how large dimension each head will operate on.

Not entirely true. Each head get its own projection, or in others words, each head gets its own compressed/transformed version of embeddings.

As in my previous point I think you misinterpret the code. I recently answered a similar question which might help illustrate the point that even the operation is single (to get Q, K, V) but the underlying channels of information are separate. Or in other words emb \cdot W_q is a single operation (for efficiency) which produces one output, but later the output is split up (isolated) for different heads.

I would doubt the results if the training is done long enough for a dataset large enough - multi-head should be superior to single head. In other words, if your results are same there must be some underlying reason.

The head dimension is definitely d_feature / n_heads. To convince yourself can :

check the Attention Is All You Need, in particular:

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

or check other DL frameworks implementations, like pytorch MultiheadAttention documentation, in particular:

num_heads – Number of parallel attention heads. Note that embed_dim will be split across num_heads (i.e. each head will have dimension embed_dim // num_heads).

Topic		Replies	Views
Transformers (Multi-head Attention) question AI Discussions	17	158	September 5, 2023
C4_W2_Assignment Multi-Head Attention input prep NLP with Attention Models week-module-2	1	533	September 19, 2022
Question about MultiHeadAttention layer Sequence Models coursera-platform	11	620	August 23, 2021
C5 W4 Attention (Q,K,V) Sequence Models coursera-platform	1	512	January 19, 2023
Sequence length, d_head variable NLP with Attention Models week-module-2	6	644	April 28, 2023

Multiheaded Attention - Number of heads and Dim of heads

Related topics