Hello!
By initializing the weights randomly for each head, you break symmetry, and you will benefit from having multiple heads. It is similar to the symmetry-breaking problem Prof Andrew Ng is talking about in the course and which Paul also is writing about in the FAQ:
More recent studies show that it is possible to prune heads and keep only the most important subset:
Your other question is the same as in how any neural network can generalize to new data. For example, suppose one head learns to pick out the subject in a sentence, and another head’s weights find the verb candidates of a sentence. Then, in later layers, the model might use the subject to pick the correct verb (she plays, we play, etc.). Sure enough, this will generalize to other sentences as well, right? However, if the training data only contains present tense verbs, the model will struggle to generalize to different verb tenses. Hence, you need high-quality training data to be able to generalize well.