Transformers EncoderLayers, Multi-Head attention or Self-Attention?

DanM · July 3, 2021, 5:19am

Sorry if this seems a little bit basic but we are supposed to use MultiHeadAttention when building the encoder right? but for the Encoder layer it seems that we are required to use self-attention like so:

class EncoderLayer(tf.keras.layers.Layer):
    def call(self, x, training, mask):
        attn_output = self.mha(x, x, x, mask)

so if q, k, v are the same then we are just performing self-attention (but with multiple heads I guess?), then why do the diagrams claim the use of multihead attention in the encoder?

if we are using self-attention only then we are missing the extra Wq, Wk, and Wv matrices right?
Aren’t these important for the multi-head attention mechanism?

edwardyu · July 5, 2021, 2:56am

Basically, you’re saying the same thing. Just like Andrew mentioned in multi-head-attention lecture, multi-head-attention is over the self attention concept. No matter the data is going to attend itself (like encoder), or attend to other data (like the 2nd part decoder), MultiHeadAttention are suitable for both.
Perhaps, you can think of what the linear transformation (Wq, Wk, Wv) is for. Without the linear transformation, we can’t apply scaled-dot-product-attention, if the inputs (q, k, v) dimensions are not same (like decoder attends to encoder output). With linear transformation to transform into the same dimension (depth axis), you can attend both self and others.
Further more, why linear transformation? Why not nonlinear? I guess (just guess) the author tried, but didn’t work, or not improved too much, compared to the calculation cost.

Topic		Replies	Views
Question about multi-head attention Sequence Models coursera-platform	2	627	June 25, 2021
C5 W4 A1: Question about MultiHeadAttention Sequence Models coursera-platform	2	739	August 4, 2021
Programming Assignment: Transformers Architecture with TensorFlow encoderlayer Sequence Models week-module-4 , coursera-platform	2	399	January 23, 2024
Understanding multi-headed attention - C5W4A1 Sequence Models coursera-platform	1	673	May 10, 2022
C5_W4_A1_Ex-4_EncoderLayer Sequence Models coursera-platform	16	2964	September 13, 2021

Transformers EncoderLayers, Multi-Head attention or Self-Attention?

Related topics