Hi, I have two BERT-related conceptual questions.
-
During BERT pre-training tasks, why do we need a transformer encoder, not decoder?
-
Masked language model (MLM) vs. Next Sentence Prediction (NSP), why does MLM not need [CLS]?
Any relevant insights would be greatly helpful. Thanks in advance!