Could someone explain the intuition behind replacing the chosen hidden tokens with either MASK, a random token, or original token in the pretraining step? Why are all three used?
Could someone explain the intuition behind replacing the chosen hidden tokens with either MASK, a random token, or original token in the pretraining step? Why are all three used?