Transformer question: what is this v-value?

Helene_Hoffmann · May 3, 2022, 7:31am

I just saw the video about the self-attention mechanism of transformers. It was really helpful, but left me with one important question: why do we multiply the softmax result with the respective v-values? What is the goal of the operation, why does it work that way and not without it?

alvaroramajo · May 3, 2022, 9:39am

Hi, @Helene_Hoffmann!

To give you an intuition of why it is done that way, I’m going to explain it based on the original Transformer paper (Vaswani et al.). This transformer architectures have “attention functions”:

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key

What does that mean? It means that for each input, three vectors are generated through three learnable parameter matrices (P_q, P_k, and P_v). The dot-product between each of this matrices and the input give the three vectors query, key, and value vectors. Therefore, the value vector plays an important role because it represents the compatibility function of query (x_{train}) and key (y_{train}) in each train step.

akkefa · May 4, 2022, 2:59pm

@Helene_Hoffmann Goal of the operation

Multiply Softmax Output with Value vector

Attention weights and multiply by value vector to get an output vector. The higher softmax scores will keep the value of words the model learns is more important. The lower scores will drown out the irrelevant words. Then you feed the output of that into a linear layer to process.

Topic		Replies	Views
Self-attention in the Transformer Network Sequence Models week-4	7	76	August 15, 2024
Self-Attention formula Sequence Models week-4	1	153	May 1, 2024
Understanding Transformer Network Sequence Models	1	558	July 29, 2021
Key_dim Multi Head attention Sequence Models	3	608	May 9, 2022
Question on Transformers Sequence Models	3	530	July 16, 2023

Transformer question: what is this v-value?

Multiply Softmax Output with Value vector

Related topics