Question about MultiHeadAttention layer

LuBinLiu · August 22, 2021, 5:54am

I have some confusion regarding the MultiHeadAttention layer. When we call an MHA layer with multiple heads and query, key, value all equal to X, using the lecture’s notation, is this equivalent to something of this form (ignoring the mask and batches) Attention(Q,K,V) = Attention(W_i^QX,W_i^KX,W_i^VX)=softmax(\frac{W_i^QX (W_i^K X)^T }{\sqrt{d_k}})W_i^VX
where the weights W_i's vary depending on head?

jonaslalin · August 22, 2021, 1:35pm

Let me think about this one more time

jonaslalin · August 22, 2021, 3:06pm

From the paper:

Now, with a little bit of ingenuity, for the query, we can construct

\begin{bmatrix} Q W_1^Q & Q W_2^Q & \cdots & Q W_h^Q \end{bmatrix} = Q \begin{bmatrix} W_1^Q & W_2^Q & \cdots & W_h^Q \end{bmatrix} = Q W^Q

where Q \in \mathbb{R}^{m \times \text{seq_len_q} \times d_{\text{model}}} and W_i^Q \in \mathbb{R}^{d_\text{model} \times d_k} with d_k = d_{\text{model}} / h according to the paper.

Consequently, we see that W^Q must have the shape (d_model, d_k * h) = (d_model, d_model / h * h) = (d_model, d_model).

If we split the output of the linear transformation into h heads, it has the same effect as calculating h heads independently and then proceeding. However, computation-wise, it is more effective only to use one dense layer and split into h heads after the linear transformation instead of splitting first and doing h linear transformations afterward.

Q W^Q has shape (m, seq_len_q, d_model) * (d_model, d_model) = (m, seq_len_q, d_model), i.e., the shape of Q again.

LuBinLiu · August 23, 2021, 4:53am

So using the paper’s notation instead, head_i = softmax(\frac{QW_i^Q(KW_i^K)^T}{\sqrt{d_k}})VW_i^V and if Q=K=V=X, the shape of X should be(batch_size, timesteps, features)?

Regarding d_k, in the assignment, k had shape key shape == (..., seq_len_k, depth), should d_k= depth?

jonaslalin · August 23, 2021, 5:31am

Yes

D_k is your last dimension. In this case it is features / heads.

If you look at scaled_dot_product_attention, it is not multi head attention. So in this case maybe it is easier to think that you only operate on one head. In that case we are not passed (batch_size, timesteps, features), but (batch_size, timesteps, d_k) as you noticed.

LuBinLiu · August 23, 2021, 3:50pm

So if a tensor has shape (dim_1, dim_2, dim_3), in the case of softmax(\frac{QK^T}{\sqrt{d_k}})V, d_k = dim_3 of K, and in the case of softmax(\frac{XW_i^Q(XW_i^K)^T}{\sqrt{d_k}})XW_i^V, d_k = dim_3 of X?

jonaslalin · August 23, 2021, 4:41pm

@LuBinLiu, the above statement is actually wrong I have had more time to look at this issue today.

The code works for both multi-head attention and when you don’t supply multiple heads. Matrix multiplication is performed on the last 2 dimensions only. The total number of dims can be 3 or 4. The function works the same and is happy to process your inputs. I have created my own version below in which you can see the shapes explicitly:

solution removed

If you don’t use tf.matmul(..., transpose_b=True) but tf.matmul(Q, transpose(K, perm=(0,1,3,2)) or tf.matmul(Q, transpose(K, perm=(0,2,1)) then you actually fix it to work only for multihead attention in the first case, and for only a single head in the last case. The key piece of code that makes the function generalize is matmul with tranpose_b=True.

jonaslalin · August 23, 2021, 5:00pm

Yes. Your X acts as both Q, K and V and the last dimension is the embedding size dimension. Depending on how you use the function, the last dimension will be d_model or d_k. D_v can be used but it is not used as a separate value in the transformer paper.

LuBinLiu · August 23, 2021, 6:51pm

Is there a reason that in the case of Attention(XW_i^Q, XW_i^K, XW_i^V), d_k is the last dimension of X as opposed to the last dimension of the product XW_i^K?

jonaslalin · August 23, 2021, 6:54pm

The last dimension of X in your formula is d_model. The dimension of the weights are d_model, d_k, so you end up with a slice of the embedding dimension which is d_k.

The authors of the paper illustrate that you use different heads. And the weight matrices helps you slice out a head.

LuBinLiu · August 23, 2021, 7:30pm

So if X has shape (..., d_{model}) and XW_i^K has shape (...., d_k), why not scale XW_i^Q(XW_i^K)^T inside the softmax of Attention(...) by \frac{1}{\sqrt{d_k}} as opposed to \frac{1}{\sqrt{d_{model}}}?

jonaslalin · August 23, 2021, 7:34pm

In that case you should scale by 1/sqrt(d_k). Where do you see d_model? In the code, I reference the last dimension of K, which probably is d_k. But if you run attention without the linear transformation, i.e. on X and not X * W^k_i, then you actually end up with d_k = d_model in the code.

Topic		Replies	Views
Is there an additional weight matrix layer for K,Q and V Sequence Models coursera-platform	9	466	August 16, 2023
Relevance of shape of Query tensor Q, K and V Sequence Models coursera-platform	10	1297	August 22, 2023
Attention Core formula explain Sequence Models week-module-4	2	50	August 16, 2025
Multiheaded Attention - Number of heads and Dim of heads NLP with Attention Models week-module-2	13	3829	May 7, 2023
C5W4 Transformers Assignment/MultiHeadAttention & Concern About Q,K and V dimensions Sequence Models coursera-platform	1	676	April 21, 2022

Question about MultiHeadAttention layer

Related topics