hey everyone!

i have a problem understanding a certain part in Andrew’s explanation and i didn’t find an explanation online/chatgpt

in the previous video he explains that:
q = Wq dot X
k = Wk dot X
v = Wk dot X

and in the multi-head attention video:

Qi = q dot WiQ
Ki = k dot WiK
Vi = v dot WiV

my problem with this explanation is why do we multiply q,k,v with wiq, wik, wiv instead of directly computing x, with wiq, wik, wiv?

aren’t q,k,v a combination of x with a certain W weight? going by that logic:
Qi = X dot Wq dot WiQ

cant we just “merge” Wq and WiQ?

i hope i managed to convey my problem of understanding

is this because Wq, Wk, Wv are shared between all tokens?

Please give the video title and time mark that references your question.

explanation on how to compute q,k,v : Self-Attention - 4:49

explanation on how to compute Qi, Ki, Vi: Multi-Head Attention - 01:05

sorry for not including this in the original post

Hello @Kami_Tzayig,

Please go to the Multi-Head Attention video, at 7:10, for the pop-up that explains the notation. The “q, k, v” in the previous video are not the “q, k, v” in the next video. We are not really multiplying two matrices in a row.

Cheers,
Raymond

Btw, @Kami_Tzayig, it was a very very nice catch

Cheers!

first of all thank you!
i have no idea how i missed this pop up
now i think i understand all the parts, thank you again

That’s fine. Missing that pop-up is not important. I am glad that you found this question.

Cheers,
Raymond