The notation I am going to use is as follows:
- In context of self-attention (first image below):
-
Superscript in angular brackets means the 'ith word in the input sequence. for e.g. in q<i>, k<i>, v<i> and A<i> the superscript ‘i’ represents the ‘ith’ word of the input sequence X:{x1,x2,x3…xTX}
-
In self-attention video, q<i> = WQ . x<i>. Similarly. k and v were defined as WK.x<i> and WV.x<i>, simultaneously.
K and WV are same for all x<i>.
-
Can someone confirm?
Now, moving on to multi-head attention, I am going to use subscript, ‘m’ for each head. Translating it to q, k, v, if we have 64 heads, the Q matrix would have {q1,q2,q3,…,qm,…,q64} queries. Similarly K = {k1,k2,k3,…,km,…,k64} and V = {v1,v2,v3,…,vm,…,v64}
I understand that for each question qm we have a weight matrix WQm. i.e, for q1 we have WQ1 and so on.
But on the slide below, it I am unable to understand WQ1.q<1>. (highlighted in blue circles below)
Does it mean we take the q<1> and already computed using WQ. x<i> in self-attention step and then multiply it with a new matrix WQ1.? or it is a typo mistake. Like, in the blue circles above, instead of
q<1> shouldn’t it be x<1>
Later, when Andrew says till this step it is your normal self attention that you saw previously, adds more to the confusion, as the equations in blue boxes doesn’t line up with equations highlighted with teal above.