Training of transformer

Mattias_Frenne · October 27, 2021, 7:03am

Hi
I have a fundamental question for my understanding of the transformer network. Andrew describes the Q as queries, like “Who is…” But I assume this is only for intuition, the training of the transformer will find the matrix W_Q that is used to map the input x to the Q matrix.

So for a trained transformer, W_Q is frozen and a human cannot understand the interpretation.

So my questions are:

For multi-headed attention, how to prevent that each head is converging to the same weights. I mean the intuition is that different heads represent different questions. But how to ensure this richness? In principle, all heads could converge to the same attention?
I assume that after training, the weights are fixed but how can the model generalize to a completely different sentence? A different sentence should provide another set of A’s but the W_Q, W_K and W_V are fixed, right? It would be nice with some intuition here…

br
Mattias

jonaslalin · October 27, 2021, 10:39am

Hello!

By initializing the weights randomly for each head, you break symmetry, and you will benefit from having multiple heads. It is similar to the symmetry-breaking problem Prof Andrew Ng is talking about in the course and which Paul also is writing about in the FAQ:

More recent studies show that it is possible to prune heads and keep only the most important subset:

Your other question is the same as in how any neural network can generalize to new data. For example, suppose one head learns to pick out the subject in a sentence, and another head’s weights find the verb candidates of a sentence. Then, in later layers, the model might use the subject to pick the correct verb (she plays, we play, etc.). Sure enough, this will generalize to other sentences as well, right? However, if the training data only contains present tense verbs, the model will struggle to generalize to different verb tenses. Hence, you need high-quality training data to be able to generalize well.

Topic		Replies	Views
MultiHeaded Attention Head Differentiation Sequence Models	1	473	May 3, 2023
Multi-head attention different weight matrices Sequence Models	4	566	November 1, 2022
Questions about Transformer W_Q, W_K and W_V Sequence Models	1	627	May 10, 2022
Self attention and redundancy NLP with Attention Models week-2	2	598	March 23, 2023
C5W4 Query analogy for weight matrices Sequence Models	10	701	March 25, 2023

Training of transformer

Related topics