i have a problem understanding a certain part in Andrew’s explanation and i didn’t find an explanation online/chatgpt
in the previous video he explains that:
q = Wq dot X
k = Wk dot X
v = Wk dot X
and in the multi-head attention video:
Qi = q dot WiQ
Ki = k dot WiK
Vi = v dot WiV
for i in HEAD
my problem with this explanation is why do we multiply q,k,v with wiq, wik, wiv instead of directly computing x, with wiq, wik, wiv?
aren’t q,k,v a combination of x with a certain W weight? going by that logic:
Qi = X dot Wq dot WiQ
cant we just “merge” Wq and WiQ?
i hope i managed to convey my problem of understanding
is this because Wq, Wk, Wv are shared between all tokens?
Please give the video title and time mark that references your question.
explanation on how to compute q,k,v : Self-Attention - 4:49
explanation on how to compute Qi, Ki, Vi: Multi-Head Attention - 01:05
sorry for not including this in the original post
Please go to the Multi-Head Attention video, at 7:10, for the pop-up that explains the notation. The “q, k, v” in the previous video are not the “q, k, v” in the next video. We are not really multiplying two matrices in a row.
Btw, @Kami_Tzayig, it was a very very nice catch
first of all thank you!
i have no idea how i missed this pop up
now i think i understand all the parts, thank you again
That’s fine. Missing that pop-up is not important. I am glad that you found this question.