C5W4 Transformer multi-head weight matrices

Hugues_Hoppe · February 5, 2022, 2:55am

The lectures present a set of linear transforms for self attention:
q = Wq * x
k = Wk * x
v = Wv * x

and then a generalization for multi-head attention:
q’_1 = W^q_1 * q
k’_1 = W^k_1 * k
v’_1 = W^v_1 * v

I’m confused by this.
Is it then the case that q’_1 = W^q_1 * Wq * q where there are two separate matrices whose weight parameters are both optimized during training?
But then this is an over-representation (because the product of the two unknown matrices collapses to a single unknown matrix). It seems better to instead define
q’_1 = W’^q_1 * x
with a single matrix for each attention head?

reinoudbosch · May 10, 2022, 4:37pm

Hi Hugues_Hoppe,

Only a single matrix is used to define q’_1 (i.e. a single matrix per query for each head).

Hardik_Modi2 · June 7, 2022, 6:25am

The notation I am going to use is as follows:

In context of self-attention (first image below):
- Superscript in angular brackets means the 'i^th word in the input sequence. for e.g. in q^, k^, v^ and A^ the superscript ‘i’ represents the ‘i^th’ word of the input sequence X:{x₁,x₂,x₃…x_{T_X}}
- In self-attention video, q^ = W^Q . x^. Similarly. k and v were defined as W^K.x^ and W^V.x^, simultaneously.
 K and W^V are same for all x^.
- Can someone confirm?

Now, moving on to multi-head attention, I am going to use subscript, ‘m’ for each head. Translating it to q, k, v, if we have 64 heads, the Q matrix would have {q₁,q₂,q₃,…,q_m,…,q₆₄} queries. Similarly K = {k₁,k₂,k₃,…,k_m,…,k₆₄} and V = {v₁,v₂,v₃,…,v_m,…,v₆₄}

I understand that for each question q_m we have a weight matrix W^Q_m. i.e, for q₁ we have W^Q₁ and so on.

But on the slide below, it I am unable to understand W^Q₁.q^<1>. (highlighted in blue circles below)

Does it mean we take the q^<1> and already computed using W^Q. x^ in self-attention step and then multiply it with a new matrix W^Q₁.? or it is a typo mistake. Like, in the blue circles above, instead of
q^<1> shouldn’t it be x^<1>

Later, when Andrew says till this step it is your normal self attention that you saw previously, adds more to the confusion, as the equations in blue boxes doesn’t line up with equations highlighted with teal above.

claravdw · June 30, 2022, 8:22am

I second this question! Is the W.q phrasing a typo and should it be W.x? Or are we missing something about why it makes sense to multiply two weight matrices?

anon57530071 · June 30, 2022, 8:33am

Please see this thread.. Andrew’s intuition is sometimes incorrect from a math view point.

Topic		Replies	Views
Learning q, k, v in self-attention and multihead attention Sequence Models	1	565	January 26, 2023
Is there an additional weight matrix layer for K,Q and V Sequence Models	9	423	August 16, 2023
W4 A1 \| Is there a typo in Multi-head attention slides? Sequence Models	9	1393	November 10, 2022
C5 W4 multi-head attention Sequence Models	7	273	January 2, 2024
Course 5 Week 4 - Transformer Networks mechanics Sequence Models	1	505	April 21, 2022

C5W4 Transformer multi-head weight matrices

Related topics