C5W4 Transformer multi-head weight matrices

The lectures present a set of linear transforms for self attention:
q = Wq * x
k = Wk * x
v = Wv * x

and then a generalization for multi-head attention:
q’_1 = W^q_1 * q
k’_1 = W^k_1 * k
v’_1 = W^v_1 * v

I’m confused by this.
Is it then the case that q’_1 = W^q_1 * Wq * q where there are two separate matrices whose weight parameters are both optimized during training?
But then this is an over-representation (because the product of the two unknown matrices collapses to a single unknown matrix). It seems better to instead define
q’_1 = W’^q_1 * x
with a single matrix for each attention head?

1 Like

Hi Hugues_Hoppe,

Only a single matrix is used to define q’_1 (i.e. a single matrix per query for each head).

The notation I am going to use is as follows:

  • In context of self-attention (first image below):
    • Superscript in angular brackets means the 'ith word in the input sequence. for e.g. in q<i>, k<i>, v<i> and A<i> the superscript ‘i’ represents the ‘ith’ word of the input sequence X:{x1,x2,x3…xTX}

    • In self-attention video, q<i> = WQ . x<i>. Similarly. k and v were defined as WK.x<i> and WV.x<i>, simultaneously.
      K and WV are same for all x<i>.

    • Can someone confirm?

Now, moving on to multi-head attention, I am going to use subscript, ‘m’ for each head. Translating it to q, k, v, if we have 64 heads, the Q matrix would have {q1,q2,q3,…,qm,…,q64} queries. Similarly K = {k1,k2,k3,…,km,…,k64} and V = {v1,v2,v3,…,vm,…,v64}

I understand that for each question qm we have a weight matrix WQm. i.e, for q1 we have WQ1 and so on.

But on the slide below, it I am unable to understand WQ1.q<1>. (highlighted in blue circles below)

Does it mean we take the q<1> and already computed using WQ. x<i> in self-attention step and then multiply it with a new matrix WQ1.? or it is a typo mistake. Like, in the blue circles above, instead of
q<1> shouldn’t it be x<1>

Later, when Andrew says till this step it is your normal self attention that you saw previously, adds more to the confusion, as the equations in blue boxes doesn’t line up with equations highlighted with teal above.

2 Likes

I second this question! Is the W.q phrasing a typo and should it be W.x? Or are we missing something about why it makes sense to multiply two weight matrices?

Please see this thread.. Andrew’s intuition is sometimes incorrect from a math view point. :disappointed_relieved: