Training of transformer

jonaslalin · October 27, 2021, 10:39am

Hello!

By initializing the weights randomly for each head, you break symmetry, and you will benefit from having multiple heads. It is similar to the symmetry-breaking problem Prof Andrew Ng is talking about in the course and which Paul also is writing about in the FAQ:

More recent studies show that it is possible to prune heads and keep only the most important subset:

Your other question is the same as in how any neural network can generalize to new data. For example, suppose one head learns to pick out the subject in a sentence, and another head’s weights find the verb candidates of a sentence. Then, in later layers, the model might use the subject to pick the correct verb (she plays, we play, etc.). Sure enough, this will generalize to other sentences as well, right? However, if the training data only contains present tense verbs, the model will struggle to generalize to different verb tenses. Hence, you need high-quality training data to be able to generalize well.

Topic		Replies	Views
MultiHeaded Attention Head Differentiation Sequence Models coursera-platform	1	475	May 3, 2023
Multi-head attention different weight matrices Sequence Models coursera-platform	4	591	November 1, 2022
C5W4 Query analogy for weight matrices Sequence Models coursera-platform	10	738	March 25, 2023
C5W4 Transformer multi-head weight matrices Sequence Models coursera-platform	4	848	June 30, 2022
The Matrix Math for self-attention Attention in Transformers: Concepts and Code in Py	4	161	February 22, 2025

Training of transformer

Related topics