MultiHeaded Attention Head Differentiation

Hello, In the Transformer model’s multi-headed attention mechanism, how do we ensure multiple heads of a transformer learn different features rather than accidentally learning the same feature?

I’m going to guess it’s because the weights are randomly initialized. This is similar to how the weights in any NN are randomly initialized.