Hi all,
Apologies for the formatting–it was necessary to be able to use certain characters in the representation
I'm currently working through understanding how attention models work,
particularly expanding self-attention models to multi-head attention models.
So far, what I grasp is that q<i>, k<i>, and v<i>
are calculated in the following manner:
q<i> = W (superscript Q) . x<i>
where W is a learned parameter or weight of x<i>.
In the Multi-head attention model, the output of the attention model
for a given word or input ( x<i> ) are:
W (sup. Q) q<i>, W (sup. K) k<i>, W (sup. V) v<i>
Does this mean that the weights calculated are then
re-multiplied by q<i>, k<i>, and <i>, and if so,
why would this be done?
___
On a side/smaller note, for the self-attention model,
what does v<i> intuitively represent?
Intuitively, q<i> . k<j> can be thought of as j's relevance
to describing i, but why would this need to be multiplied
by another learned value v<i>?
Thank you all,
David