In the video on multi-head attention, there seems to be a typo in the calculation of q, v and k at 1:49.
The slide shows that q2 = W1q * q2, while it should be q2 = W1q * x2, to be in conformity with the previous video.
Could you confirm that it is indeed a typo in the slide ?
I don’t see any calculation of q2 in that video around that time mark.
Can you post a screen capture image showing the issue?
My mistake. I mistook the Wq, Wk and Wv matrices used to compute q, k and v vectors (q=Wq.x, k=Wk.x and v=Wv.x), and the Wq1, Wk1 and Wv1 matrices used in the multi-head attention.
I suppose that you calculate q, k and v using the same matrices through heads, and then you multiply the result with a matrix specific for each head to differentiate.
Thus for head 1 : q1 = Wq1.Wq.x and for head 2 : q2 = Wq2.Wq.x
Is my reasoning correct ?
Hi @Smoulis, I do have the same reasoning as yours and I will love to know that you were correct.
I agree with @Smoulis. The alternative – Wi.qj – seems to imply a redundant matrix.