Hi all, I have 3 questions about how we calculate/train the parameter matrices W_Q, W_K and W_V.
In a machine translation model, during training, are all the parameters, including W_Q, W_K and W_V and other dense layer parameters, softmax parameters trained and updated at the SAME time when we have finished translating the input and use the cross entropy loss function? Or do we somehow pre-train W_Q, W_K and W_V (How?) and use them to get Q, K, V then train other network parameters?
In the multihead case, how do we get different sets of W matrices so that different Q, K , V’s? I mean, if we have found a set of Q, K, V that best represents the relations among the input words, why do we need several different Q, K, V and how do we train the model to make sure we get different sets of them?
If we stack up the multihead attentions by concatenation, how is that different from simply increasing the dimension of the attention vector?
Thanks for any help!