[Week 4] BERT Pre-Training Concepts

Sapphire_Hou · December 12, 2022, 1:57pm

Hi, I have two BERT-related conceptual questions.

During BERT pre-training tasks, why do we need a transformer encoder, not decoder?
Masked language model (MLM) vs. Next Sentence Prediction (NSP), why does MLM not need [CLS]?

Any relevant insights would be greatly helpful. Thanks in advance!

balaji.ambresh · December 12, 2022, 2:54pm

A decoder is required for generational tasks like english to french translation. The encoder is responsible for performing attention on the input sentence. To figure out the masked word, it’s sufficient to take the masked token entry from the final encoder representation and feed it to a softmax.
[CLS] represents a classification / class token. NSP requires an output of 0 / 1 to say if the sentences are in order. MLM can do good with encoder representation and so it might be okay to get rid of [CLS] token.

Topic		Replies	Views
Transformer Decoder Mask Input NLP with Attention Models week-module-3	1	531	August 12, 2022
Output layer of BERT NLP with Attention Models week-module-3	10	799	September 29, 2023
I don't understand the transformer's decoder Generative AI with Large Language Models week-module-1	2	197	July 24, 2024
Inference for NMT NLP with Attention Models week-module-2	11	433	June 23, 2023
Few doubts regarding the pre-training and working of t5 transformers NLP with Attention Models week-module-3	2	339	November 9, 2023