Hi @Ritu_Pande
This is a good question. Here is the the part in the paper explaining it:
I can try to explain it in simpler words:
- 80% of the time the model receives an input [“my”, “dog”, “is”, [MASK] and it has to correctly predict the [MASK] word, which could be “hairy” and if the model assigns high probability to this word, then the loss is low;
- 10% of the time the model receives an input [“my”, “dog”, “is”, “apple”] and the model still has to correctly predict the “hairy” word in that place (the “apple”'s place). That means that the model has to learn from the context words “my dog is” and not be fooled by the word “apple”. This is somewhat similar to data-augmentation in vision - throwing in some noise (very small amount - 1.5% of all tokens) for robustness.
- 10% of the time the model receives an input [“my”, “dog”, “is”, “hairy”] and the model has to correctly predict the word “hairy” word in that place. This might be tricky to understand, but it helps the model to adjust the word’s “hairy” embedding when the model “knows” that the word is “hairy” in the first place
(while in the first case (80% case) the token here is [MASK])
I hope that makes sense
Cheers