Course 5 Week 4 - Transformer Networks mechanics

DS1 · December 17, 2021, 8:55pm

Hi all,

Apologies for the formatting–it was necessary to be able to use certain characters in the representation

I'm currently working through understanding how attention models work, 
particularly expanding self-attention models to multi-head attention models.
 
So far, what I grasp is that q<i>, k<i>, and v<i> 
are calculated in the following manner:
 
 q<i> = W (superscript Q) . x<i> 
       where W is a learned parameter or weight of x<i>.
 
In the Multi-head attention model, the output of the attention model
for a given word or input ( x<i> ) are:
 
 W (sup. Q) q<i>, W (sup. K) k<i>, W (sup. V) v<i>
 
Does this mean that the weights calculated are then 
re-multiplied by q<i>, k<i>, and <i>, and if so, 
why would this be done?
 
 ___
 
On a side/smaller note, for the self-attention model,
what does v<i> intuitively represent?  
Intuitively, q<i> . k<j> can be thought of as j's relevance 
to describing i, but why would this need to be multiplied 
by another learned value v<i>?

Thank you all,
David

TMosh · April 21, 2022, 5:28am

Did you find an answer to your question?

Topic		Replies	Views
Is there an additional weight matrix layer for K,Q and V Sequence Models coursera-platform	9	427	August 16, 2023
Clarification regarding attention and self attention Sequence Models coursera-platform	4	595	August 22, 2021
C5W4 Transformer multi-head weight matrices Sequence Models coursera-platform	4	828	June 30, 2022
C5 W4 multi-head attention Sequence Models coursera-platform	7	280	January 2, 2024
W4 A1 \| Is there a typo in Multi-head attention slides? Sequence Models coursera-platform	9	1447	November 10, 2022

Course 5 Week 4 - Transformer Networks mechanics

Related topics