Questions about Transformer W_Q, W_K and W_V

Hi all, I have 3 questions about how we calculate/train the parameter matrices W_Q, W_K and W_V.

  1. In a machine translation model, during training, are all the parameters, including W_Q, W_K and W_V and other dense layer parameters, softmax parameters trained and updated at the SAME time when we have finished translating the input and use the cross entropy loss function? Or do we somehow pre-train W_Q, W_K and W_V (How?) and use them to get Q, K, V then train other network parameters?

  2. In the multihead case, how do we get different sets of W matrices so that different Q, K , V’s? I mean, if we have found a set of Q, K, V that best represents the relations among the input words, why do we need several different Q, K, V and how do we train the model to make sure we get different sets of them?

  3. If we stack up the multihead attentions by concatenation, how is that different from simply increasing the dimension of the attention vector?
    Thanks for any help!

Hi haoyundeng,

With regard to your first question: yes, they are updated at the same time. Remember that the matrices are initialized with a random component.

Concerning your second and third questions: we get different sets of W matrices as a result of the random element of initialization. Using different Q, K, and V matrices allows for the extraction of different meaning feature structures in the text. An intuition to grasp this may be to think of different matrices as being concerned with different meaning universes. So, each element of the dimension of the matrices is concerned with extracting a particular meaning feature, whereas each different matrix extracts in a different meaning universe. Empirically, this has been found to allow for a richer overal meaning feature extraction process resulting in better performance than solely using a single matrix concerned with a single meaning universe.