Why we use randomizer()<0.15?

Amazing_Patrick · May 5, 2023, 11:09pm

We are masking 15% of the words. Why we use random number less than 15% to mask them?

arvyzukai · May 8, 2023, 5:08am

It’s a common simple approach - sampling random uniform distribution and checking against some threshold. On average, out of 1000 random uniform samples ~150 would be less than 0.15.

The point why 15% was addressed in the paper (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) - in short - better accuracy. (Note the Appendix C the bottom of the paper).

Cheers

Amazing_Patrick · May 9, 2023, 6:51am

Still not quite understand the logic here. I randomly input a sentence with 44 tokens after tokenization. But when the noise rate is 0.15, the masked tokens should be less than 6. But with the logics in the code, I can get the opportunity to have 10 tokens to masked, like in the code I have 34 tokens in the input.

arvyzukai · May 9, 2023, 7:17am

On average the masked tokens would be 6.6. If you ran a lot of trials, on average you would have 6.6 masked tokens. On a single trial you can have 0 or even 44 masked trials (even it’s practically impossible - 5.598 \times 10^{-37})

You can play with code and count the numbers manually to check if the results are what they are supposed to be.

You can also play with online calculators like this or this(input n=44, p=0.15, x=11) or this to get a better intuition.

Topic		Replies	Views
Understanding Masking NLP with Attention Models week-module-3	3	561	September 19, 2023
BERT pretraining NLP with Attention Models week-module-3	1	347	February 6, 2024
[C2W1] Dropout Regularization - Lecture issue Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	539	January 11, 2022
Regularization Week 1 Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	600	June 16, 2021
Transformer Decoder Mask Input NLP with Attention Models week-module-3	1	521	August 12, 2022

Why we use randomizer()<0.15?

Related topics