OK, I suppose you already catch some key points. Let’s start in a reverse order.

As notations are slight complex, I tried to write down a whole picture of Transformer Encorder with using the same parameter name as what we learn in Jupyter notebook. (Note that I used the definition of the original paper for Q/K/V and weights for those, the order of Q and W is different from Andrew’s chart. But, it a matter of transpose. Do not worry about that portion.

You see multi-head attention function in the center of this figure. As it may be slightly small, I will paste another image which focuses on the multi-head attention only later.

In multi-header attention layer there are 3 steps as below.

- Linear operation (dot product of inputs and weights), and dispatch queries, keys and values to appropriate "head"s.
- (In each head) scaled dot product attention to calculate attention scores (with Softmax). This is a parallel operation to distribute tasks to multiple heads to work separately. (A big difference from RNN.)
- Concatenate outputs from all heads, and calculate the “dot product” of this concatenated output and W_0 which is another weight for concatenated output.

Then, going through a fully connected layer, we get updated X in here. Then, this goes into the encoder layer (multi-head attention layer) again.

The key point is, for “self-attention”, X is used for Q, K and V. Yes, inputs are same. In this sense, q^{<1>} is same as the first word vector (+positional encoding) in X.

Then, we separate Q\cdot W^Q into small queries. (same to keys and values.)

So, weights for W_1^Q, W_2^Q, .. are not applied to q^{<1>}, q^{<>2}, .. yet. That is an operation inside “multi-head attention”.

In this sense, Andew’s chart for multi-head attention is correct. (of course, assuming that my chart is correct… )

Then, the next discussion is about the Self-attention. Apparently, q^{<3>}, k^{<3>}, .. are “weighted”. In this sense, as you point out, this may not be inconsistent to “multi-head attention”.

My interpretation is, this is part of “Self-Attention Intuition” to explain how queries, keys and values works together. (excluding weights which need another discussion.)

In net, I think you understand correctly, and also I understand your points. Please consider a chart and explanation for “self-attention” are for intuition.