hey everyone!
i have a problem understanding a certain part in Andrew’s explanation and i didn’t find an explanation online/chatgpt
in the previous video he explains that:
q = Wq dot X
k = Wk dot X
v = Wk dot X
and in the multi-head attention video:
Qi = q dot WiQ
Ki = k dot WiK
Vi = v dot WiV
for i in HEAD
my problem with this explanation is why do we multiply q,k,v with wiq, wik, wiv instead of directly computing x, with wiq, wik, wiv?
aren’t q,k,v a combination of x with a certain W weight? going by that logic:
Qi = X dot Wq dot WiQ
cant we just “merge” Wq and WiQ?
i hope i managed to convey my problem of understanding
is this because Wq, Wk, Wv are shared between all tokens?
TMosh
January 1, 2024, 7:05pm
3
Please give the video title and time mark that references your question.
explanation on how to compute q,k,v : Self-Attention - 4:49
explanation on how to compute Qi, Ki, Vi: Multi-Head Attention - 01:05
sorry for not including this in the original post
Hello @Kami_Tzayig ,
Please go to the Multi-Head Attention video, at 7:10, for the pop-up that explains the notation. The “q, k, v” in the previous video are not the “q, k, v” in the next video. We are not really multiplying two matrices in a row.
Cheers,
Raymond
Btw, @Kami_Tzayig , it was a very very nice catch
Cheers!
first of all thank you!
i have no idea how i missed this pop up
now i think i understand all the parts, thank you again
That’s fine. Missing that pop-up is not important. I am glad that you found this question.
Cheers,
Raymond