Course 5 Week 4 - Transformer Networks mechanics

Hi all,

Apologies for the formatting–it was necessary to be able to use certain characters in the representation

I'm currently working through understanding how attention models work, 
particularly expanding self-attention models to multi-head attention models.
 
So far, what I grasp is that q<i>, k<i>, and v<i> 
are calculated in the following manner:
 
 q<i> = W (superscript Q) . x<i> 
       where W is a learned parameter or weight of x<i>.
 
In the Multi-head attention model, the output of the attention model
for a given word or input ( x<i> ) are:
 
 W (sup. Q) q<i>, W (sup. K) k<i>, W (sup. V) v<i>
 
Does this mean that the weights calculated are then 
re-multiplied by q<i>, k<i>, and <i>, and if so, 
why would this be done?
 
 ___
 
On a side/smaller note, for the self-attention model,
what does v<i> intuitively represent?  
Intuitively, q<i> . k<j> can be thought of as j's relevance 
to describing i, but why would this need to be multiplied 
by another learned value v<i>?

Thank you all,
David

Did you find an answer to your question?