Based on the video with the “l’Afrique” example, I don’t understand why the different heads will have different values after training (will converge even if initialization is random).
I am trying to understand how will the different “questions” (what’s happening, who, etc) will be imposed on the model since the calculations are the same.
Are the q’s different for each head?
Because in self-attention, the q belonged to the term for which the Attention was being built.
As a much belated reply:
You can look at the different heads as quantitative indicators of different perspectives on meaning dependencies between words. The meaning of words in a text may depend in various ways on other words in the text. This can depend on multiple meanings of words, multiple wider contexts of a text, background understandings required to translate from an example to an output text during training, and so on.
A deeper understanding of this requires a discussion of the philosophy of language - in particular hermeneutics - but this lies outside the scope of the course.
As you can see in the video on self-attention and multi-head attention, the parameters determining the value of queries are calibrated during training and so these will differ between heads.
Thank you for your answer, I have the same trouble understanding why weight matrixes are different between different heads with the same q, k and v. Could I say heads’ differences are due to the gradient descent will converge with different weight matrixes? If that is the case, Is there any mechanism to prevent multi-head from ending up with the same weight matrix?
This issue has been discussed here. This discussion may also clarify.
Thank you so much, it’s really helpful.