BERT pretraining

blackdragon · February 4, 2024, 4:59am

Could someone explain the intuition behind replacing the chosen hidden tokens with either MASK, a random token, or original token in the pretraining step? Why are all three used?

arvyzukai · February 6, 2024, 6:11pm

Hi @blackdragon

Here is my previous attempt at explaining this.

Cheers

Topic		Replies	Views
Transformer Decoder Mask Input NLP with Attention Models week-module-3	1	541	August 12, 2022
Few doubts regarding the pre-training and working of t5 transformers NLP with Attention Models week-module-3	2	344	November 9, 2023
Understanding Masking NLP with Attention Models week-module-3	3	602	September 19, 2023
# UNQ_C3 help with mask NLP with Attention Models week-module-1	1	573	April 6, 2022
Please explain the comment: Notice that both encoder and decoder padding masks are equal NLP with Attention Models week-module-2	2	311	March 4, 2024

BERT pretraining

Related topics