Hello
The transformers assignment states at some point that " Remember that to compute self-attention Q, V and K should be the same". I don´t understand why. The lectures do not mention anything about making Q,V and K the same when computing self-attention
Thanks for the clarification
Regards,
Boris M.
The lectures on this topic are rather incomplete. We’ve requested some updates, they’re not available yet.
By definition, “self-attention” means you use exactly the same data for K, Q, and V.
It’s not well-explained in the lectures.
Hi,
It’s been several months since this question was raised here, but I still have the same question (not sure if anything in the lectures was updated to cover that). So, why do K,Q, and V need to be the same for self-attention? How does this make sense to have three matrixes but have them be the same (i.e. isn’t it just a waste of parameters)?
Thanks,
Tal
Relatedly, I would also like to understand why in the decoder in the same assignment (week 4 assignment 1) k and v are the same (both are enc_output).
It would be great if someone can provide a general explanation on when q,k,v are the same, and when they would all be distinct from each other. I have seen several questions about this in the forum, but no answers that actually clarify this…
Thanks!