Multiheaded attention question

Yasamin_Nouri · January 6, 2024, 11:45am

Week 4, Multi-Head Attention | Coursera
I am still a bit confused why we use W_i^Q * q, W_i^K * k, W_i^V * v instead of W_i^Q * x, W_i^K * x, W_i^V * x. could we have written it like this? if so, how does the algorithm learn different embeddings for each head? Thank you in advance!

balaji.ambresh · January 6, 2024, 8:35pm

Attention is calculated differently across encoder and the decoder. As a result, it’s good to follow the conventions followed by the original authors.

Please remember that initialization is different across each weight matrix and so the updates will also be different. Your argument is the same as expecting all entries of a dense layer to be the same.

Topic		Replies	Views
Multi-head attention different weight matrices Sequence Models	4	563	November 1, 2022
Question about MultiHeadAttention layer Sequence Models	11	620	August 23, 2021
Clarification regarding attention and self attention Sequence Models	4	591	August 22, 2021
C5W4 Multi-head attention Sequence Models	4	692	May 10, 2023
Question about multi-head attention Sequence Models	2	618	June 25, 2021

Multiheaded attention question

Related topics