Confusion regarding the video on BERT Objective

Anthony_Wu · September 3, 2023, 3:36pm

I don’t understand the purpose of the loss functions, isn’t it that for next sentence prediction and masked word prediction, the data input to the BERT model is unlabeled? How do we use the loss function in this case?

Also, for the slide attached below, does this pre-training visualize the masked word prediction? How about the next sentence prediction? How does the next sentence prediction work? Does it work by having the embeddings for sentence A fed into the model with a [SEP] token and what happens next to produce the sentence B?

gent.spah · September 4, 2023, 7:07am

For question 1 there is labeled data used to train the model, hence the loss function.

arvyzukai · September 4, 2023, 7:08am

Hi @Anthony_Wu

The purpose of loss functions is to decide which weights to increase and which to decrease.

The input data are unlabeled before masking, for example:

b’Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.’

After masking:

Inputs: 

 <Z> BBQ Class Taking Place in Missoul <Y> Do you want to get better at making <X>? You will have the opportunity, put <W> your calendar now. Thursday, September 22 <V> World Class BBQ Champion, Tony Balay <U>onestar Smoke Rangers. He <T> teaching a beginner level class for everyone<S> to get better with their culinary skills.<R> teach you everything you need to know to <Q> a KCBS BBQ competition,<P>, recipes, timelines, meat selection <O>, plus smoker and fire information. The<N> be in the class is $35 per person <M> for spectators it is free. Include <L> the cost will be either a  <K>shirt or apron and you <J> tasting samples of each meat that is prepared <I>

Targets: 

 <Z> Beginners <Y>a! <X> delicious BBQ <W> this on <V>nd join <U> from L <T> will be<S> who wants<R> He will <Q> compete in<P> including techniques <O> and trimming<N> cost to <M>, and <L>d in <K>t- <J> will be <I>.

This is the masked word prediction part - the model should output high probabilities for these words (or tokens to be more precise, because sometimes these tokens are sub-words, sometimes words and sometimes they are multiple words) for these sentinels. (If the probabilities for the corresponding tokens are high - the loss would be low, on the other hand if probabilities for the corresponding tokens are low - the loss would be high).

As for the slide you posted, this slide is for Next Sentence Prediction (not for masked word prediction).

In “Next Sentence Prediction” the model basically has two inputs - sentence A and sentence B. The model has to guess if the sentence B is the next sentence for sentence B. If the model predicts correct - loss is low, if model predicts incorrect - loss is high.

Cheers

Topic		Replies	Views
Output layer of BERT NLP with Attention Models week-module-3	10	795	September 29, 2023
Transformer Decoder Mask Input NLP with Attention Models week-module-3	1	530	August 12, 2022
BertModel Generating Input and Output NLP with Attention Models week-module-3	1	530	August 12, 2022
C3_W2 Assignment NER model - masked loss and prediction NLP with Sequence Models week-module-2	2	125	June 21, 2024
How are 2 objective functions are implemented in BERT like transformer models? Sequence Models coursera-platform	3	522	May 29, 2023

Confusion regarding the video on BERT Objective

Related topics