How are the weights of queries, keys and values are found to establish relevant words ? Are they random?

in 01_03 multi-head attention.

Considering the word “l’Afrique”, for each head, represented by subscript i, we use a new set of learned weight matrices (W_i^Q, W_i^K, W_i^V) to compute query, key, and value vectors. Each head asks a different question about the input sequence.

  • First Head: Asks, “Who is involved?”

    • Computes {q_1^{<1>} = W_1^Q \cdot q^{<1>}}, {K_1 = W_1^K \cdot k^{<1>}}, {V_1 = W_1^V \cdot v^{<1>}}. ( \small\text{represented by }{\color{black}\longrightarrow}\mathbf{{Attention}(W_i^QQ, W_i^K K, W_i^VV)} \small\text{ in the image } )
    • For example, the word “Jane” might have the highest relevance to the question about “l’Afrique”.
  • Second Head: Asks, “What’s happening?”

    • Computes \textcolor{blue}{q_2^{<2>} = W_2^Q \cdot q^{<2>}}, \textcolor{blue}{K_2 = W_2^K \cdot k^{<2>}}, \textcolor{blue}{V_2 = W_2^V \cdot v^{<2>}}.
    • For example, the word “visite” might have the highest relevance to the question about “l’Afrique”. ( \small\text{represented by }{\color{blue}\longrightarrow}\mathbf{{Attention}(W_i^QQ, W_i^K K, W_i^VV)} \small\text{ in the image } )
  • Third Head: Asks, “When is something happening?”

    • Computes \textcolor{red}{q_3^{<5>} = W_3^Q \cdot q^{<5>}}, \textcolor{red}{K_3 = W_3^K \cdot k^{<5>}}, \textcolor{red}{V_3 = W_3^V \cdot v^{<5>}}. ( \small\text{represented by }{\color{red}\longrightarrow}\mathbf{{Attention}(W_i^QQ, W_i^K K, W_i^VV)} \small\text{ in the image } )
    • For example, the word “septembre” might have the highest relevance to the question about “l’Afrique”.
  • Fourth Head and so on upto ith(8th as per the norm) Head

    • Computes q_i^{<j>} = W_i^Q \cdot q^{<j>}, K_3 = W_i^K \cdot k^{<j>}, V_3 = W_i^V \cdot v^{<j>}.
    • Finds the jth relevant word which answers the question about “l’Afrique”.

I am surprised/don’t understand how this relationship of one word “l’Afrique” to to find out the most relevant word based on the multiple of queries (q_i^{<3>}) and keys (k_i^{<3>}) called attention score in each head ({head}_i) is established. I think that the learned weights W_i^Q, W_i^K, W_i^V first do it using the attention score which is then further modified using softmax applied to normalized attention score and then finally multiplying each of the softmax probabilities by the values v_i^{<3>}.

Are all these weights randomly selected, becuase what kind of errors each of these queries , keys and values they try to reduce while moving towards this goal of establising relevant contextual information w.r.t all the words in the sentence during forward/backpropogation.

Hi @munish259272,

In multi-head attention, different heads learn to capture different types of relationships between words in a sentence. Each attention head has its own set of learned weight matrices ( W_Q^i, W_K^i, W_V^i ) , which convert the input embeddings into query, key, and value vectors. The dot-product attention mechanism then calculates relevance scores between the query and all keys, followed by a softmax function to determine attention weights. The final output of each head is a weighted sum of the value vectors (more relevant words receive higher weights).

Each head focuses on a different linguistic aspect. For example, one head may identify who is involved, another may focus on what is happening, and another may determine when something happens. This allows the model to process sentence structures more effectively.

The weights in multi-head attention are randomly initialized at the start of training. But, through backpropagation and gradient descent, the model learns to adjust these weights to improve its performance. Over time, the learned weight matrices develop meaningful representations and help the model generalize across different sentences.

Hope this helps! Let me know if you need further assistance.

1 Like