Understanding Masking

arvyzukai · September 19, 2023, 5:58am

This is a good question. Here is the the part in the paper explaining it:

I can try to explain it in simpler words:

80% of the time the model receives an input [“my”, “dog”, “is”, [MASK] and it has to correctly predict the [MASK] word, which could be “hairy” and if the model assigns high probability to this word, then the loss is low;
10% of the time the model receives an input [“my”, “dog”, “is”, “apple”] and the model still has to correctly predict the “hairy” word in that place (the “apple”'s place). That means that the model has to learn from the context words “my dog is” and not be fooled by the word “apple”. This is somewhat similar to data-augmentation in vision - throwing in some noise (very small amount - 1.5% of all tokens) for robustness.
10% of the time the model receives an input [“my”, “dog”, “is”, “hairy”] and the model has to correctly predict the word “hairy” word in that place. This might be tricky to understand, but it helps the model to adjust the word’s “hairy” embedding when the model “knows” that the word is “hairy” in the first place (while in the first case (80% case) the token here is [MASK])

I hope that makes sense

Cheers

Topic		Replies	Views
BERT pretraining NLP with Attention Models week-module-3	1	347	February 6, 2024
BertModel Generating Input and Output NLP with Attention Models week-module-3	1	519	August 12, 2022
Why we use randomizer()<0.15? NLP with Attention Models week-module-3	3	474	May 9, 2023
Predicting Next Set of Tokens in Decoder Model Generative AI with Large Language Models week-module-1	7	578	August 10, 2023
Output layer of BERT NLP with Attention Models week-module-3	10	764	September 29, 2023