Hi, @Helene_Hoffmann!

To give you an intuition of why it is done that way, I’m going to explain it based on the original Transformer paper (Vaswani et al.). This transformer architectures have “attention functions”:

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key

What does that mean? It means that for each input, three vectors are generated through three learnable parameter matrices (P_q, P_k, and P_v). The dot-product between each of this matrices and the input give the three vectors query, key, and value vectors. Therefore, the value vector plays an important role because it represents the compatibility function of query (x_{train}) and key (y_{train}) in each train step.