C5 W4 multi-head attention

Kami_Tzayig · January 1, 2024, 4:00pm

hey everyone!

i have a problem understanding a certain part in Andrew’s explanation and i didn’t find an explanation online/chatgpt

in the previous video he explains that:
q = Wq dot X
k = Wk dot X
v = Wk dot X

and in the multi-head attention video:

Qi = q dot WiQ
Ki = k dot WiK
Vi = v dot WiV
for i in HEAD

my problem with this explanation is why do we multiply q,k,v with wiq, wik, wiv instead of directly computing x, with wiq, wik, wiv?

aren’t q,k,v a combination of x with a certain W weight? going by that logic:
Qi = X dot Wq dot WiQ

cant we just “merge” Wq and WiQ?

i hope i managed to convey my problem of understanding

Kami_Tzayig · January 1, 2024, 4:01pm

is this because Wq, Wk, Wv are shared between all tokens?

TMosh · January 1, 2024, 7:05pm

Please give the video title and time mark that references your question.

Kami_Tzayig · January 2, 2024, 3:06am

explanation on how to compute q,k,v : Self-Attention - 4:49

explanation on how to compute Qi, Ki, Vi: Multi-Head Attention - 01:05

sorry for not including this in the original post

rmwkwok · January 2, 2024, 6:10am

Hello @Kami_Tzayig,

Please go to the Multi-Head Attention video, at 7:10, for the pop-up that explains the notation. The “q, k, v” in the previous video are not the “q, k, v” in the next video. We are not really multiplying two matrices in a row.

Cheers,
Raymond

rmwkwok · January 2, 2024, 6:56am

Btw, @Kami_Tzayig, it was a very very nice catch

Cheers!

Kami_Tzayig · January 2, 2024, 8:48am

first of all thank you!
i have no idea how i missed this pop up
now i think i understand all the parts, thank you again

rmwkwok · January 2, 2024, 9:41am

That’s fine. Missing that pop-up is not important. I am glad that you found this question.

Cheers,
Raymond

Topic		Replies	Views
C5W4 Multi-head attention Sequence Models coursera-platform	4	712	May 10, 2023
W4 A1 \| Is there a typo in Multi-head attention slides? Sequence Models coursera-platform	9	1463	November 10, 2022
Learning q, k, v in self-attention and multihead attention Sequence Models coursera-platform	1	587	January 26, 2023
Course 5 Week 4 - Transformer Networks mechanics Sequence Models coursera-platform	1	508	April 21, 2022
C5W4 Transformer multi-head weight matrices Sequence Models coursera-platform	4	830	June 30, 2022

C5 W4 multi-head attention

Related topics