in 01_03 multi-head attention.
Considering the word “l’Afrique”, for each head, represented by subscript i, we use a new set of learned weight matrices (W_i^Q, W_i^K, W_i^V) to compute query, key, and value vectors. Each head asks a different question about the input sequence.
-
First Head: Asks, “Who is involved?”
- Computes {q_1^{<1>} = W_1^Q \cdot q^{<1>}}, {K_1 = W_1^K \cdot k^{<1>}}, {V_1 = W_1^V \cdot v^{<1>}}. ( \small\text{represented by }{\color{black}\longrightarrow}\mathbf{{Attention}(W_i^QQ, W_i^K K, W_i^VV)} \small\text{ in the image } )
- For example, the word “Jane” might have the highest relevance to the question about “l’Afrique”.
-
Second Head: Asks, “What’s happening?”
- Computes \textcolor{blue}{q_2^{<2>} = W_2^Q \cdot q^{<2>}}, \textcolor{blue}{K_2 = W_2^K \cdot k^{<2>}}, \textcolor{blue}{V_2 = W_2^V \cdot v^{<2>}}.
- For example, the word “visite” might have the highest relevance to the question about “l’Afrique”. ( \small\text{represented by }{\color{blue}\longrightarrow}\mathbf{{Attention}(W_i^QQ, W_i^K K, W_i^VV)} \small\text{ in the image } )
-
Third Head: Asks, “When is something happening?”
- Computes \textcolor{red}{q_3^{<5>} = W_3^Q \cdot q^{<5>}}, \textcolor{red}{K_3 = W_3^K \cdot k^{<5>}}, \textcolor{red}{V_3 = W_3^V \cdot v^{<5>}}. ( \small\text{represented by }{\color{red}\longrightarrow}\mathbf{{Attention}(W_i^QQ, W_i^K K, W_i^VV)} \small\text{ in the image } )
- For example, the word “septembre” might have the highest relevance to the question about “l’Afrique”.
-
Fourth Head and so on upto ith(8th as per the norm) Head
- Computes q_i^{<j>} = W_i^Q \cdot q^{<j>}, K_3 = W_i^K \cdot k^{<j>}, V_3 = W_i^V \cdot v^{<j>}.
- Finds the jth relevant word which answers the question about “l’Afrique”.
I am surprised/don’t understand how this relationship of one word “l’Afrique” to to find out the most relevant word based on the multiple of queries (q_i^{<3>}) and keys (k_i^{<3>}) called attention score in each head ({head}_i) is established. I think that the learned weights W_i^Q, W_i^K, W_i^V first do it using the attention score which is then further modified using softmax applied to normalized attention score and then finally multiplying each of the softmax probabilities by the values v_i^{<3>}.
Are all these weights randomly selected, becuase what kind of errors each of these queries , keys and values they try to reduce while moving towards this goal of establising relevant contextual information w.r.t all the words in the sentence during forward/backpropogation.