Multiheaded Attention - Number of heads and Dim of heads

Elemento · May 5, 2023, 7:34am

Hey @nmurugesh and @arvyzukai,
I spent more than an hour understanding your takes on this, since both seemed to be correct to me and I guess, both of them are indeed correct as well. What Prof Andrew mentioned in DLS C5 W4 is correct, and what is mentioned in NLP C4 W2 is also correct. It’s just the use of notation that is causing the confusion. I will be referring to one DLS thread here.

I believe that PyTorch and Tensorflow uses different notation for the same thing. I have borrowed the below image from the reference.

As you can see in W^Q that there is indeed a split in embedding_dim for the n_heads along the horizontal direction, but there is no split in the vertical direction. So, what @arvyzukai has been mentioning all along is the split along the horizontal direction, and what @nmurugesh has been mentioning all along is the no-split across the vertical direction.

In DLS, Prof Andrew took the matrices for the different heads as W_i^Q, W_i^K, W_i^V, i.e., there were 3 * n_heads matrices each of dimension (embedding_dim // n_heads, embedding_dim). In other words, there was no split, since the matrices taken were already in their correct shapes.

But in NLP, we have taken them as 3 matrices only, each of dimension (embedding_dim, embedding_dim), and for axis = 0, we are considering a logical split so that the first dimension is split into embedding_dim // n_heads.

In both the cases, if you will see, after performing concatenation, we will get 3 matrices of dimensions (embedding_dim, embedding_dim). Let me know what you guys think about this.

Cheers,
Elemento

Topic		Replies	Views
Transformers (Multi-head Attention) question AI Discussions	17	152	September 5, 2023
Multi-head attention Generative AI with Large Language Models conceptual-question	3	21	February 17, 2025
Question about multi-head attention Sequence Models	2	624	June 25, 2021
C4_W2_Assignment Multi-Head Attention input prep NLP with Attention Models week-2	1	532	September 19, 2022
C4 W2 Assignment UNQ_C5 - why assert d_feature % n_heads == 0 NLP with Attention Models week-2	1	586	January 30, 2022

Multiheaded Attention - Number of heads and Dim of heads

Related topics