Multi-head attention different weight matrices

gkouro · August 23, 2022, 11:04am

Based on the video with the “l’Afrique” example, I don’t understand why the different heads will have different values after training (will converge even if initialization is random).

I am trying to understand how will the different “questions” (what’s happening, who, etc) will be imposed on the model since the calculations are the same.

Are the q’s different for each head?

Because in self-attention, the q belonged to the term for which the Attention was being built.

reinoudbosch · October 24, 2022, 9:15pm

Hi gkouro,

As a much belated reply:

You can look at the different heads as quantitative indicators of different perspectives on meaning dependencies between words. The meaning of words in a text may depend in various ways on other words in the text. This can depend on multiple meanings of words, multiple wider contexts of a text, background understandings required to translate from an example to an output text during training, and so on.
A deeper understanding of this requires a discussion of the philosophy of language - in particular hermeneutics - but this lies outside the scope of the course.
As you can see in the video on self-attention and multi-head attention, the parameters determining the value of queries are calibrated during training and so these will differ between heads.

liuyf · October 25, 2022, 8:27pm

Hi reinoudbosch,

Thank you for your answer, I have the same trouble understanding why weight matrixes are different between different heads with the same q, k and v. Could I say heads’ differences are due to the gradient descent will converge with different weight matrixes? If that is the case, Is there any mechanism to prevent multi-head from ending up with the same weight matrix?

reinoudbosch · October 25, 2022, 11:50pm

Hi liuyf,

This issue has been discussed here. This discussion may also clarify.

liuyf · November 1, 2022, 1:51pm

Hi reinoudbosch,

Thank you so much, it’s really helpful.

Topic		Replies	Views
Clarification regarding attention and self attention Sequence Models	4	593	August 22, 2021
Multiheaded attention question Sequence Models week-4	1	266	January 6, 2024
C5W4 Query analogy for weight matrices Sequence Models	10	701	March 25, 2023
Is there an additional weight matrix layer for K,Q and V Sequence Models	9	423	August 16, 2023
Course 5 - Week 4 - A1 - Exercise 4 - EncoderLayer Sequence Models week-4	2	39	August 13, 2024

Multi-head attention different weight matrices

Related topics