I just saw the video about the self-attention mechanism of transformers. It was really helpful, but left me with one important question: why do we multiply the softmax result with the respective v-values? What is the goal of the operation, why does it work that way and not without it?

Hi, @Helene_Hoffmann!

To give you an intuition of why it is done that way, Iâ€™m going to explain it based on the original Transformer paper (Vaswani et al.). This transformer architectures have â€śattention functionsâ€ť:

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key

What does that mean? It means that for each input, three vectors are generated through three learnable parameter matrices (P_q, P_k, and P_v). The dot-product between each of this matrices and the input give the three vectors query, key, and value vectors. Therefore, the value vector plays an important role because it represents the compatibility function of query (x_{train}) and key (y_{train}) in each train step.

@Helene_Hoffmann **Goal of the operation**

## Multiply Softmax Output with Value vector

Attention weights and multiply by value vector to get an output vector. The higher softmax scores will keep the value of words the model learns is more important. The lower scores will drown out the irrelevant words. Then you feed the output of that into a linear layer to process.