C4 W2 Assignment UNQ_C5 - why assert d_feature % n_heads == 0

Yeshwant_Dattatreya · December 3, 2021, 5:27am

Why should d_features (the embedding dimension of tokens) be divisible by number of heads? Is this purely for GPU computational efficiency?

reinoudbosch · January 30, 2022, 11:36pm

Hi Yeshwant_Dattatreya,

Yes.

The authors of the Attention Is All You Need paper write the following:

“In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality”.

Topic		Replies	Views
Multiheaded Attention - Number of heads and Dim of heads NLP with Attention Models week-module-2	13	3973	May 7, 2023
C4_W2_Assignment Multi-Head Attention input prep NLP with Attention Models week-module-2	1	544	September 19, 2022
Why do we reduce dimension per head in multi-head attention? Is it actually necessary, or just efficient? AI Discussions ai-discussions	1	22	March 25, 2026
C4 W2 Assignment - why do we have compute_attention_heads and compute_attention_output functions? NLP with Attention Models week-module-2	1	577	January 31, 2022
Sequence length, d_head variable NLP with Attention Models week-module-2	6	664	April 28, 2023

C4 W2 Assignment UNQ_C5 - why assert d_feature % n_heads == 0

Related topics