C5W4 Multi-head attention

Smoulis · October 6, 2021, 6:58am

In the video on multi-head attention, there seems to be a typo in the calculation of q, v and k at 1:49.
The slide shows that q2 = W1q * q2, while it should be q2 = W1q * x2, to be in conformity with the previous video.

Could you confirm that it is indeed a typo in the slide ?

Regards

TMosh · October 6, 2021, 6:58pm

I don’t see any calculation of q2 in that video around that time mark.
Can you post a screen capture image showing the issue?

Smoulis · October 8, 2021, 8:48am

My mistake. I mistook the Wq, Wk and Wv matrices used to compute q, k and v vectors (q=Wq.x, k=Wk.x and v=Wv.x), and the Wq1, Wk1 and Wv1 matrices used in the multi-head attention.

I suppose that you calculate q, k and v using the same matrices through heads, and then you multiply the result with a matrix specific for each head to differentiate.
Thus for head 1 : q1 = Wq1.Wq.x and for head 2 : q2 = Wq2.Wq.x

Is my reasoning correct ?

Akingbeni_David · March 1, 2023, 9:28am

Hi @Smoulis, I do have the same reasoning as yours and I will love to know that you were correct.

Thanks.

Nikita_Ostrovsky · May 10, 2023, 2:05pm

I agree with @Smoulis. The alternative – Wi.qj – seems to imply a redundant matrix.

Topic		Replies	Views
C5 W4 multi-head attention Sequence Models	7	270	January 2, 2024
C5W4 Transformer multi-head weight matrices Sequence Models	4	809	June 30, 2022
Question about multi-head attention Sequence Models	2	619	June 25, 2021
Learning q, k, v in self-attention and multihead attention Sequence Models	1	545	January 26, 2023
Multi-Head Attention - Question about the slide Sequence Models week-4	1	132	May 13, 2024

C5W4 Multi-head attention

Related topics