BERT pretraining

Could someone explain the intuition behind replacing the chosen hidden tokens with either MASK, a random token, or original token in the pretraining step? Why are all three used?

Hi @blackdragon

Here is my previous attempt at explaining this.

Cheers