Why we use randomizer()<0.15?

We are masking 15% of the words. Why we use random number less than 15% to mask them?

Hi @Amazing_Patrick

It’s a common simple approach - sampling random uniform distribution and checking against some threshold. On average, out of 1000 random uniform samples ~150 would be less than 0.15.

The point why 15% was addressed in the paper (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) - in short - better accuracy. (Note the Appendix C the bottom of the paper).


1 Like

Still not quite understand the logic here. I randomly input a sentence with 44 tokens after tokenization. But when the noise rate is 0.15, the masked tokens should be less than 6. But with the logics in the code, I can get the opportunity to have 10 tokens to masked, like in the code I have 34 tokens in the input.

On average the masked tokens would be 6.6. If you ran a lot of trials, on average you would have 6.6 masked tokens. On a single trial you can have 0 or even 44 masked trials (even it’s practically impossible - 5.598 \times 10^{-37})

You can play with code and count the numbers manually to check if the results are what they are supposed to be.

You can also play with online calculators like this or this(input n=44, p=0.15, x=11) or this to get a better intuition.