Question about multi-head attention

liangyuantong · June 24, 2021, 1:31pm

In the self-attentional mechanism lecture we calculate the values q,k and v to get A , i understand. But when talking about Multi Head, NG used three new matrices respectively and QKV calculation to get attention. Is that a mistake?
If it’s true then why?

I read the transformer article on jagblog, which also calculates attention by QKV without multiplying it by a new matrix again.

Please forgive me for my poor English =, =

edwardyu · June 25, 2021, 2:17pm

It’s because MultiHeadAttention is more versatile then self-attention. In self-attention, Q, K, V are from the same inputs and have the same dimension, but MultiHeadAttention may be not. For example, the 2nd MultiHeadAttention block in Transformer Decoder takes the Q from the first MultiHeadAttention block and K, V from the encoder output. The decoder and encoder have different dimensions. Therefore, linear transformation is used to make them have the same depth for further calculation. It looks like below (copy from the tutorial.)

class MultiHeadAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    """......"""
    self.wq = tf.keras.layers.Dense(d_model)
    self.wk = tf.keras.layers.Dense(d_model)
    self.wv = tf.keras.layers.Dense(d_model)
    """......"""

  def call(self, v, k, q, mask):
    """......"""
    q = self.wq(q)  # (batch_size, seq_len, d_model)
    k = self.wk(k)  # (batch_size, seq_len, d_model)
    v = self.wv(v)  # (batch_size, seq_len, d_model)
    """......"""

liangyuantong · June 25, 2021, 3:18pm

i get it thank you

Topic		Replies	Views
Transformers EncoderLayers, Multi-Head attention or Self-Attention? Sequence Models	1	1058	July 5, 2021
Q,K,V all are same for self attention Sequence Models	5	655	November 19, 2023
C5W4 Multi-head attention Sequence Models	4	701	May 10, 2023
Is there an additional weight matrix layer for K,Q and V Sequence Models	9	426	August 16, 2023
C5 W4 A1: Question about MultiHeadAttention Sequence Models	2	739	August 4, 2021

Question about multi-head attention

Related topics