Questions about Transformer W_Q, W_K and W_V

reinoudbosch · May 10, 2022, 5:11pm

Hi haoyundeng,

With regard to your first question: yes, they are updated at the same time. Remember that the matrices are initialized with a random component.

Concerning your second and third questions: we get different sets of W matrices as a result of the random element of initialization. Using different Q, K, and V matrices allows for the extraction of different meaning feature structures in the text. An intuition to grasp this may be to think of different matrices as being concerned with different meaning universes. So, each element of the dimension of the matrices is concerned with extracting a particular meaning feature, whereas each different matrix extracts in a different meaning universe. Empirically, this has been found to allow for a richer overal meaning feature extraction process resulting in better performance than solely using a single matrix concerned with a single meaning universe.

Topic		Replies	Views
C5W4 Query analogy for weight matrices Sequence Models	10	701	March 25, 2023
Clarification of definitions in transformer model Sequence Models	1	510	December 17, 2021
C5W4 Transformer multi-head weight matrices Sequence Models	4	818	June 30, 2022
Confusion about Q, K, and V matrices NLP with Attention Models week-2	9	5756	February 17, 2025
Training of transformer Sequence Models	1	640	October 27, 2021

Questions about Transformer W_Q, W_K and W_V

Related topics