Sorry if this seems a little bit basic but we are supposed to use MultiHeadAttention when building the encoder right? but for the Encoder layer it seems that we are required to use self-attention like so:
so if q, k, v are the same then we are just performing self-attention (but with multiple heads I guess?), then why do the diagrams claim the use of multihead attention in the encoder?
if we are using self-attention only then we are missing the extra Wq, Wk, and Wv matrices right?
Aren’t these important for the multi-head attention mechanism?
Basically, you’re saying the same thing. Just like Andrew mentioned in multi-head-attention lecture, multi-head-attention is over the self attention concept. No matter the data is going to attend itself (like encoder), or attend to other data (like the 2nd part decoder), MultiHeadAttention are suitable for both.
Perhaps, you can think of what the linear transformation (Wq, Wk, Wv) is for. Without the linear transformation, we can’t apply scaled-dot-product-attention, if the inputs (q, k, v) dimensions are not same (like decoder attends to encoder output). With linear transformation to transform into the same dimension (depth axis), you can attend both self and others.
Further more, why linear transformation? Why not nonlinear? I guess (just guess) the author tried, but didn’t work, or not improved too much, compared to the calculation cost.