Week 4, Multi-Head Attention | Coursera
I am still a bit confused why we use W_i^Q * q, W_i^K * k, W_i^V * v instead of W_i^Q * x, W_i^K * x, W_i^V * x. could we have written it like this? if so, how does the algorithm learn different embeddings for each head? Thank you in advance!
Attention is calculated differently across encoder and the decoder. As a result, it’s good to follow the conventions followed by the original authors.
Please remember that initialization is different across each weight matrix and so the updates will also be different. Your argument is the same as expecting all entries of a dense layer to be the same.