W4 A1 | Is there a typo in Multi-head attention slides?

anon57530071 · June 9, 2022, 7:38am

OK, I suppose you already catch some key points. Let’s start in a reverse order.
As notations are slight complex, I tried to write down a whole picture of Transformer Encorder with using the same parameter name as what we learn in Jupyter notebook. (Note that I used the definition of the original paper for Q/K/V and weights for those, the order of Q and W is different from Andrew’s chart. But, it a matter of transpose. Do not worry about that portion.

You see multi-head attention function in the center of this figure. As it may be slightly small, I will paste another image which focuses on the multi-head attention only later.

In multi-header attention layer there are 3 steps as below.

Linear operation (dot product of inputs and weights), and dispatch queries, keys and values to appropriate "head"s.
(In each head) scaled dot product attention to calculate attention scores (with Softmax). This is a parallel operation to distribute tasks to multiple heads to work separately. (A big difference from RNN.)
Concatenate outputs from all heads, and calculate the “dot product” of this concatenated output and W_0 which is another weight for concatenated output.

Then, going through a fully connected layer, we get updated X in here. Then, this goes into the encoder layer (multi-head attention layer) again.
The key point is, for “self-attention”, X is used for Q, K and V. Yes, inputs are same. In this sense, q^{<1>} is same as the first word vector (+positional encoding) in X.

Then, we separate Q\cdot W^Q into small queries. (same to keys and values.)

So, weights for W_1^Q, W_2^Q, .. are not applied to q^{<1>}, q^{<>2}, .. yet. That is an operation inside “multi-head attention”.
In this sense, Andew’s chart for multi-head attention is correct. (of course, assuming that my chart is correct… )

Then, the next discussion is about the Self-attention. Apparently, q^{<3>}, k^{<3>}, .. are “weighted”. In this sense, as you point out, this may not be inconsistent to “multi-head attention”.

My interpretation is, this is part of “Self-Attention Intuition” to explain how queries, keys and values works together. (excluding weights which need another discussion.)

In net, I think you understand correctly, and also I understand your points. Please consider a chart and explanation for “self-attention” are for intuition.

Topic		Replies	Views
C5W4 Transformer multi-head weight matrices Sequence Models	4	818	June 30, 2022
How is Self Attention Q=Wx related to multi-head attention WQ Sequence Models	8	542	October 18, 2022
Learning q, k, v in self-attention and multihead attention Sequence Models	1	566	January 26, 2023
Course 5 Week 4 - Transformer Networks mechanics Sequence Models	1	507	April 21, 2022
Multi-Head Attention - Question about the slide Sequence Models week-4	1	140	May 13, 2024

W4 A1 | Is there a typo in Multi-head attention slides?

Related topics