Understanding Masking

Ritu_Pande · September 18, 2023, 3:00am

I am unbale to fully understand the masking procedure. We choose 15% of the input tokens to be replaced by masked token . Out of those 10% are instead replaced by random tokens and 10% are kept as is.

It is said that this is done so that the model does not learn only to predict masked tokens.

However, when the objective is calculated, if it only calculates loss for the masked tokens, how can randomly replaced tokens and tokens left as is, play any role?

arvyzukai · September 19, 2023, 5:58am

Hi @Ritu_Pande

This is a good question. Here is the the part in the paper explaining it:

I can try to explain it in simpler words:

80% of the time the model receives an input [“my”, “dog”, “is”, [MASK] and it has to correctly predict the [MASK] word, which could be “hairy” and if the model assigns high probability to this word, then the loss is low;
10% of the time the model receives an input [“my”, “dog”, “is”, “apple”] and the model still has to correctly predict the “hairy” word in that place (the “apple”'s place). That means that the model has to learn from the context words “my dog is” and not be fooled by the word “apple”. This is somewhat similar to data-augmentation in vision - throwing in some noise (very small amount - 1.5% of all tokens) for robustness.
10% of the time the model receives an input [“my”, “dog”, “is”, “hairy”] and the model has to correctly predict the word “hairy” word in that place. This might be tricky to understand, but it helps the model to adjust the word’s “hairy” embedding when the model “knows” that the word is “hairy” in the first place (while in the first case (80% case) the token here is [MASK])

I hope that makes sense

Cheers

Ritu_Pande · September 19, 2023, 2:08pm

So what this means is that the cross-entropy loss is calculated not just for masked tokens but for all 15% of the tokens sampled for masking i.e. includes the tokens replaced randomly and left as is; Am I correct ?

arvyzukai · September 19, 2023, 2:23pm

Yes @Ritu_Pande, you are correct

Topic		Replies	Views
BertModel Generating Input and Output NLP with Attention Models week-module-3	1	530	August 12, 2022
BERT pretraining NLP with Attention Models week-module-3	1	352	February 6, 2024
Why we use randomizer()<0.15? NLP with Attention Models week-module-3	3	483	May 9, 2023
Transformer Decoder Mask Input NLP with Attention Models week-module-3	1	530	August 12, 2022
The purpose of the Mask NLP with Attention Models week-module-1	3	735	December 28, 2022

Understanding Masking

Related topics