Multiheaded Attention - Number of heads and Dim of heads

Hey @nmurugesh and @arvyzukai,
I spent more than an hour understanding your takes on this, since both seemed to be correct to me :joy: and I guess, both of them are indeed correct as well. What Prof Andrew mentioned in DLS C5 W4 is correct, and what is mentioned in NLP C4 W2 is also correct. It’s just the use of notation that is causing the confusion. I will be referring to one DLS thread here.

I believe that PyTorch and Tensorflow uses different notation for the same thing. I have borrowed the below image from the reference.

image

As you can see in W^Q that there is indeed a split in embedding_dim for the n_heads along the horizontal direction, but there is no split in the vertical direction. So, what @arvyzukai has been mentioning all along is the split along the horizontal direction, and what @nmurugesh has been mentioning all along is the no-split across the vertical direction.

In DLS, Prof Andrew took the matrices for the different heads as W_i^Q, W_i^K, W_i^V, i.e., there were 3 * n_heads matrices each of dimension (embedding_dim // n_heads, embedding_dim). In other words, there was no split, since the matrices taken were already in their correct shapes.

But in NLP, we have taken them as 3 matrices only, each of dimension (embedding_dim, embedding_dim), and for axis = 0, we are considering a logical split so that the first dimension is split into embedding_dim // n_heads.

In both the cases, if you will see, after performing concatenation, we will get 3 matrices of dimensions (embedding_dim, embedding_dim). Let me know what you guys think about this.

Cheers,
Elemento