How are the weights of queries, keys and values are found to establish relevant words ? Are they random?

munish259272 · February 12, 2025, 6:54am

in 01_03 multi-head attention.

Considering the word “l’Afrique”, for each head, represented by subscript i, we use a new set of learned weight matrices (W_i^Q, W_i^K, W_i^V) to compute query, key, and value vectors. Each head asks a different question about the input sequence.

First Head: Asks, “Who is involved?”
- Computes {q_1^{<1>} = W_1^Q \cdot q^{<1>}}, {K_1 = W_1^K \cdot k^{<1>}}, {V_1 = W_1^V \cdot v^{<1>}}. ( \small\text{represented by }{\color{black}\longrightarrow}\mathbf{{Attention}(W_i^QQ, W_i^K K, W_i^VV)} \small\text{ in the image } )
- For example, the word “Jane” might have the highest relevance to the question about “l’Afrique”.
Second Head: Asks, “What’s happening?”
- Computes \textcolor{blue}{q_2^{<2>} = W_2^Q \cdot q^{<2>}}, \textcolor{blue}{K_2 = W_2^K \cdot k^{<2>}}, \textcolor{blue}{V_2 = W_2^V \cdot v^{<2>}}.
- For example, the word “visite” might have the highest relevance to the question about “l’Afrique”. ( \small\text{represented by }{\color{blue}\longrightarrow}\mathbf{{Attention}(W_i^QQ, W_i^K K, W_i^VV)} \small\text{ in the image } )
Third Head: Asks, “When is something happening?”
- Computes \textcolor{red}{q_3^{<5>} = W_3^Q \cdot q^{<5>}}, \textcolor{red}{K_3 = W_3^K \cdot k^{<5>}}, \textcolor{red}{V_3 = W_3^V \cdot v^{<5>}}. ( \small\text{represented by }{\color{red}\longrightarrow}\mathbf{{Attention}(W_i^QQ, W_i^K K, W_i^VV)} \small\text{ in the image } )
- For example, the word “septembre” might have the highest relevance to the question about “l’Afrique”.
Fourth Head and so on upto ith(8th as per the norm) Head
- Computes q_i^{<j>} = W_i^Q \cdot q^{<j>}, K_3 = W_i^K \cdot k^{<j>}, V_3 = W_i^V \cdot v^{<j>}.
- Finds the jth relevant word which answers the question about “l’Afrique”.

I am surprised/don’t understand how this relationship of one word “l’Afrique” to to find out the most relevant word based on the multiple of queries (q_i^{<3>}) and keys (k_i^{<3>}) called attention score in each head ({head}_i) is established. I think that the learned weights W_i^Q, W_i^K, W_i^V first do it using the attention score which is then further modified using softmax applied to normalized attention score and then finally multiplying each of the softmax probabilities by the values v_i^{<3>}.

Are all these weights randomly selected, becuase what kind of errors each of these queries , keys and values they try to reduce while moving towards this goal of establising relevant contextual information w.r.t all the words in the sentence during forward/backpropogation.

Alireza_Saei · February 12, 2025, 10:05am

Hi @munish259272,

In multi-head attention, different heads learn to capture different types of relationships between words in a sentence. Each attention head has its own set of learned weight matrices ( W_Q^i, W_K^i, W_V^i ) , which convert the input embeddings into query, key, and value vectors. The dot-product attention mechanism then calculates relevance scores between the query and all keys, followed by a softmax function to determine attention weights. The final output of each head is a weighted sum of the value vectors (more relevant words receive higher weights).

Each head focuses on a different linguistic aspect. For example, one head may identify who is involved, another may focus on what is happening, and another may determine when something happens. This allows the model to process sentence structures more effectively.

The weights in multi-head attention are randomly initialized at the start of training. But, through backpropagation and gradient descent, the model learns to adjust these weights to improve its performance. Over time, the learned weight matrices develop meaningful representations and help the model generalize across different sentences.

Hope this helps! Let me know if you need further assistance.

Topic		Replies	Views
Multi-head attention different weight matrices Sequence Models coursera-platform	4	572	November 1, 2022
Clarification regarding attention and self attention Sequence Models coursera-platform	4	595	August 22, 2021
C5W4 Query analogy for weight matrices Sequence Models coursera-platform	10	701	March 25, 2023
Is there an additional weight matrix layer for K,Q and V Sequence Models coursera-platform	9	427	August 16, 2023
Course 5 Week 4 - Transformer Networks mechanics Sequence Models coursera-platform	1	508	April 21, 2022

How are the weights of queries, keys and values are found to establish relevant words ? Are they random?

Related topics