I have a fundamental question for my understanding of the transformer network. Andrew describes the Q as queries, like “Who is…” But I assume this is only for intuition, the training of the transformer will find the matrix W_Q that is used to map the input x to the Q matrix.
So for a trained transformer, W_Q is frozen and a human cannot understand the interpretation.
So my questions are:
For multi-headed attention, how to prevent that each head is converging to the same weights. I mean the intuition is that different heads represent different questions. But how to ensure this richness? In principle, all heads could converge to the same attention?
I assume that after training, the weights are fixed but how can the model generalize to a completely different sentence? A different sentence should provide another set of A’s but the W_Q, W_K and W_V are fixed, right? It would be nice with some intuition here…
By initializing the weights randomly for each head, you break symmetry, and you will benefit from having multiple heads. It is similar to the symmetry-breaking problem Prof Andrew Ng is talking about in the course and which Paul also is writing about in the FAQ:
More recent studies show that it is possible to prune heads and keep only the most important subset:
Your other question is the same as in how any neural network can generalize to new data. For example, suppose one head learns to pick out the subject in a sentence, and another head’s weights find the verb candidates of a sentence. Then, in later layers, the model might use the subject to pick the correct verb (she plays, we play, etc.). Sure enough, this will generalize to other sentences as well, right? However, if the training data only contains present tense verbs, the model will struggle to generalize to different verb tenses. Hence, you need high-quality training data to be able to generalize well.