Why should d_features (the embedding dimension of tokens) be divisible by number of heads? Is this purely for GPU computational efficiency?
Hi Yeshwant_Dattatreya,
Yes.
The authors of the Attention Is All You Need paper write the following:
“In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality”.