hey everyone!

i have a problem understanding a certain part in Andrew’s explanation and i didn’t find an explanation online/chatgpt

in the previous video he explains that:
q = Wq dot X
k = Wk dot X
v = Wk dot X

and in the multi-head attention video:

Qi = q dot WiQ
Ki = k dot WiK
Vi = v dot WiV
for i in HEAD

my problem with this explanation is why do we multiply q,k,v with wiq, wik, wiv instead of directly computing x, with wiq, wik, wiv?

aren’t q,k,v a combination of x with a certain W weight? going by that logic:
Qi = X dot Wq dot WiQ

cant we just “merge” Wq and WiQ?

i hope i managed to convey my problem of understanding

is this because Wq, Wk, Wv are shared between all tokens?

TMosh
January 1, 2024, 7:05pm
3
Please give the video title and time mark that references your question.

explanation on how to compute q,k,v : Self-Attention - 4:49

explanation on how to compute Qi, Ki, Vi: Multi-Head Attention - 01:05

sorry for not including this in the original post

Hello @Kami_Tzayig ,

Please go to the Multi-Head Attention video, at 7:10, for the pop-up that explains the notation. The “q, k, v” in the previous video are not the “q, k, v” in the next video. We are not really multiplying two matrices in a row.

Cheers,
Raymond

Btw, @Kami_Tzayig , it was a very very nice catch

Cheers!

first of all thank you!
i have no idea how i missed this pop up
now i think i understand all the parts, thank you again

That’s fine. Missing that pop-up is not important. I am glad that you found this question.

Cheers,
Raymond