[Week 4] BERT Pre-Training Concepts

Hi, I have two BERT-related conceptual questions.

  1. During BERT pre-training tasks, why do we need a transformer encoder, not decoder?

  2. Masked language model (MLM) vs. Next Sentence Prediction (NSP), why does MLM not need [CLS]?

Any relevant insights would be greatly helpful. Thanks in advance!

  1. A decoder is required for generational tasks like english to french translation. The encoder is responsible for performing attention on the input sentence. To figure out the masked word, it’s sufficient to take the masked token entry from the final encoder representation and feed it to a softmax.
  2. [CLS] represents a classification / class token. NSP requires an output of 0 / 1 to say if the sentences are in order. MLM can do good with encoder representation and so it might be okay to get rid of [CLS] token.
1 Like