In the self-attentional mechanism lecture we calculate the values q,k and v to get A , i understand. But when talking about Multi Head, NG used three new matrices respectively and QKV calculation to get attention. Is that a mistake?

If it’s true then why?

I read the transformer article on jagblog, which also calculates attention by QKV without multiplying it by a new matrix again.

Please forgive me for my poor English =, =

It’s because MultiHeadAttention is more versatile then self-attention. In self-attention, Q, K, V are from the same inputs and have the same dimension, but MultiHeadAttention may be not. For example, the 2nd MultiHeadAttention block in Transformer Decoder takes the Q from the first MultiHeadAttention block and K, V from the encoder output. The decoder and encoder have different dimensions. Therefore, linear transformation is used to make them have the same depth for further calculation. It looks like below (copy from the tutorial.)

```
class MultiHeadAttention(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads):
"""......"""
self.wq = tf.keras.layers.Dense(d_model)
self.wk = tf.keras.layers.Dense(d_model)
self.wv = tf.keras.layers.Dense(d_model)
"""......"""
def call(self, v, k, q, mask):
"""......"""
q = self.wq(q) # (batch_size, seq_len, d_model)
k = self.wk(k) # (batch_size, seq_len, d_model)
v = self.wv(v) # (batch_size, seq_len, d_model)
"""......"""
```