Questions about Transformer W_Q, W_K and W_V

Hi haoyundeng,

With regard to your first question: yes, they are updated at the same time. Remember that the matrices are initialized with a random component.

Concerning your second and third questions: we get different sets of W matrices as a result of the random element of initialization. Using different Q, K, and V matrices allows for the extraction of different meaning feature structures in the text. An intuition to grasp this may be to think of different matrices as being concerned with different meaning universes. So, each element of the dimension of the matrices is concerned with extracting a particular meaning feature, whereas each different matrix extracts in a different meaning universe. Empirically, this has been found to allow for a richer overal meaning feature extraction process resulting in better performance than solely using a single matrix concerned with a single meaning universe.