Question about the sentinels

I am confused about the sentinels. When we run sentinels = get_sentinels(vocab_size, display=True) , it takes the last items of the vocab and assign the sentinels backward from the alphabet, starting with"Z" then “Y” etc, and the sentinels look like:

The sentinel is and the decoded token is: Internațional
The sentinel is and the decoded token is: erwachsene
The sentinel is and the decoded token is: Cushion

However, after we run the pretty_decode function and using the sentinel as input functions, it actually swaps different token for “Z” and “Y”, here in this example, replaces “Class”, not “International”. It is not very clear how this happens. Why we predefine Sentinels using the get_sentinels functioin. Suppose in different input, outputs the same sentinels “Y” “Z” are being used to replace different word depending on which word is being masked by random chance.

inputs:
Beginners BBQ Taking in Missoula! want to get better
making delicious ? You will have the opportunity, put this on
calendar now Thursday, September 22nd World Class BBQ Champion,
Tony Balay from Lonestar Smoke Rangers. He be a beginner


class for everyone wants to better their skills. He
will teach you you need to know compete in KCBS BBQ
competition, including techniques, recipes,s, meat selection
trimming, plus smoker information. The cost to be in the class is
$35 per person for spectator is free. Included in the cost will
be either a t-shirt or apron and you will tasting samples of each
meat that .

targets:
Class Place Do you at BBQ your . join
will teaching

level who get with culinary

Hi @PZ2004

I don’t quite understand you question. Please read the last paragraph of this post and tell if you still have questions. In short, in T5:

Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.

While in our assignment we do not implement the decoder part (and also other T5 tasks) so the assignment’s encoder has to predict the sentinels correctly.

Cheers

P.S. use “\” to escape the “<” symbols (that what makes your text bold or striked out and it is hard to read).

Sorry for the strange formatting. Here is the reformatted question:

I am confused about the sentinels. When we run sentinels = get_sentinels(vocab_size, display=True) , it takes the last items of the vocab and assign the sentinels backward from the alphabet, starting with"Z" then “Y” etc, and the sentinels look like:

However, after we run the pretty_decode function and using the sentinel as input functions, it actually swaps different text for “Z” and here in this example, replaces “Beginners” not “International”. It is not very clear how this happens.

The first picture is just for illustration purpose. It is used with the following sentence - “I want to dress up as an Intellectual this halloween.”, where <V> is “Intellectual” and “halloween” is <b>.

There was also another example provided:
image

This time the sentinels are “for inviting” and “last”. The job for the model is to predict them right (assign high probabilities “for inviting” in case of <X> and “last” in case of <Y>)

The same would go for your other longer example about BBQ class where the randomizer is used to illustrate which tokens did the randomizer chose to mask - it chose <Z> for “Beginners” (you need two tokens [12847, 277], <Y> for “a!” (you need two tokens [ 9, 55], etc. That was done with the help of prev_no_mask variable and some logic.

Cheers

Thanks Arvyzukai. This is what I understand how this code is intended to be. But I wasn’t sure how this was achieved in the codes.

Since pretty_decode uses “sentinels” as an input,
the variable “sentinels” need to be updated when a new text inputs and new sets of sentinel-value pairs are selected. It is not clear to me where and how this was done in the function tokenize_and_mask where the input and outputs with sentinel incorporated. This function doesn’t output updated sentinels. This is where it gets very confusing to me.