C4 W2 Assignment UNQ_C5 - why assert d_feature % n_heads == 0

Why should d_features (the embedding dimension of tokens) be divisible by number of heads? Is this purely for GPU computational efficiency?

Hi Yeshwant_Dattatreya,

Yes.

The authors of the Attention Is All You Need paper write the following:

“In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality”.