Confusion regarding the video on BERT Objective

I don’t understand the purpose of the loss functions, isn’t it that for next sentence prediction and masked word prediction, the data input to the BERT model is unlabeled? How do we use the loss function in this case?

Also, for the slide attached below, does this pre-training visualize the masked word prediction? How about the next sentence prediction? How does the next sentence prediction work? Does it work by having the embeddings for sentence A fed into the model with a [SEP] token and what happens next to produce the sentence B?

For question 1 there is labeled data used to train the model, hence the loss function.

Hi @Anthony_Wu

The purpose of loss functions is to decide which weights to increase and which to decrease.

The input data are unlabeled before masking, for example:

b’Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.’

After masking:

Inputs: 

 <Z> BBQ Class Taking Place in Missoul <Y> Do you want to get better at making <X>? You will have the opportunity, put <W> your calendar now. Thursday, September 22 <V> World Class BBQ Champion, Tony Balay <U>onestar Smoke Rangers. He <T> teaching a beginner level class for everyone<S> to get better with their culinary skills.<R> teach you everything you need to know to <Q> a KCBS BBQ competition,<P>, recipes, timelines, meat selection <O>, plus smoker and fire information. The<N> be in the class is $35 per person <M> for spectators it is free. Include <L> the cost will be either a  <K>shirt or apron and you <J> tasting samples of each meat that is prepared <I>

Targets: 

 <Z> Beginners <Y>a! <X> delicious BBQ <W> this on <V>nd join <U> from L <T> will be<S> who wants<R> He will <Q> compete in<P> including techniques <O> and trimming<N> cost to <M>, and <L>d in <K>t- <J> will be <I>.

This is the masked word prediction part - the model should output high probabilities for these words (or tokens to be more precise, because sometimes these tokens are sub-words, sometimes words and sometimes they are multiple words) for these sentinels. (If the probabilities for the corresponding tokens are high - the loss would be low, on the other hand if probabilities for the corresponding tokens are low - the loss would be high).


As for the slide you posted, this slide is for Next Sentence Prediction (not for masked word prediction).

In “Next Sentence Prediction” the model basically has two inputs - sentence A and sentence B. The model has to guess if the sentence B is the next sentence for sentence B. If the model predicts correct - loss is low, if model predicts incorrect - loss is high.

Cheers

1 Like