Questions about Transformer W_Q, W_K and W_V

haoyundeng · February 9, 2022, 4:09am

Hi all, I have 3 questions about how we calculate/train the parameter matrices W_Q, W_K and W_V.

In a machine translation model, during training, are all the parameters, including W_Q, W_K and W_V and other dense layer parameters, softmax parameters trained and updated at the SAME time when we have finished translating the input and use the cross entropy loss function? Or do we somehow pre-train W_Q, W_K and W_V (How?) and use them to get Q, K, V then train other network parameters?
In the multihead case, how do we get different sets of W matrices so that different Q, K , V’s? I mean, if we have found a set of Q, K, V that best represents the relations among the input words, why do we need several different Q, K, V and how do we train the model to make sure we get different sets of them?
If we stack up the multihead attentions by concatenation, how is that different from simply increasing the dimension of the attention vector?
Thanks for any help!

reinoudbosch · May 10, 2022, 5:11pm

Hi haoyundeng,

With regard to your first question: yes, they are updated at the same time. Remember that the matrices are initialized with a random component.

Concerning your second and third questions: we get different sets of W matrices as a result of the random element of initialization. Using different Q, K, and V matrices allows for the extraction of different meaning feature structures in the text. An intuition to grasp this may be to think of different matrices as being concerned with different meaning universes. So, each element of the dimension of the matrices is concerned with extracting a particular meaning feature, whereas each different matrix extracts in a different meaning universe. Empirically, this has been found to allow for a richer overal meaning feature extraction process resulting in better performance than solely using a single matrix concerned with a single meaning universe.

Topic		Replies	Views
C5W4 Query analogy for weight matrices Sequence Models	10	701	March 25, 2023
Clarification of definitions in transformer model Sequence Models	1	510	December 17, 2021
C5W4 Transformer multi-head weight matrices Sequence Models	4	818	June 30, 2022
Confusion about Q, K, and V matrices NLP with Attention Models week-2	9	5753	February 17, 2025
Training of transformer Sequence Models	1	640	October 27, 2021

Questions about Transformer W_Q, W_K and W_V

Related topics