I am unbale to fully understand the masking procedure. We choose 15% of the input tokens to be replaced by masked token . Out of those 10% are instead replaced by random tokens and 10% are kept as is.
It is said that this is done so that the model does not learn only to predict masked tokens.
However, when the objective is calculated, if it only calculates loss for the masked tokens, how can randomly replaced tokens and tokens left as is, play any role?
Hi @Ritu_Pande
This is a good question. Here is the the part in the paper explaining it:
I can try to explain it in simpler words:
- 80% of the time the model receives an input [“my”, “dog”, “is”, [MASK] and it has to correctly predict the [MASK] word, which could be “hairy” and if the model assigns high probability to this word, then the loss is low;
- 10% of the time the model receives an input [“my”, “dog”, “is”, “apple”] and the model still has to correctly predict the “hairy” word in that place (the “apple”'s place). That means that the model has to learn from the context words “my dog is” and not be fooled by the word “apple”. This is somewhat similar to data-augmentation in vision - throwing in some noise (very small amount - 1.5% of all tokens) for robustness.
- 10% of the time the model receives an input [“my”, “dog”, “is”, “hairy”] and the model has to correctly predict the word “hairy” word in that place. This might be tricky to understand, but it helps the model to adjust the word’s “hairy” embedding when the model “knows” that the word is “hairy” in the first place
(while in the first case (80% case) the token here is [MASK])
I hope that makes sense 
Cheers
1 Like
So what this means is that the cross-entropy loss is calculated not just for masked tokens but for all 15% of the tokens sampled for masking i.e. includes the tokens replaced randomly and left as is; Am I correct ?
Yes @Ritu_Pande, you are correct