Understanding Sentinels in C4W3_Assignment

/notebooks/C4W3_Assignment.ipynb

From what I understand, sentinels are used in the decoder of the T5 as targets. They are like placeholders.

Lets assume vocab size is 32,000.

Lets assume I=2, Love =3, learning=4, and =5

Eg sentence: “I love machine learning and deep learning”

After random selection, the words “machine“ and “deep“ are selected for masking, and the input to the encoder will be as follows:

“2 3 31998 4 5 31997 4“

The input to the decoder will be :

“31998 machine 31997 deep“

But, the vocab already has valid lookup in pos 31998 and 31999 as we have already seen in the get_sentinels() method, which prints out:

The sentinel is <Z> and the decoded token is: Internațional
The sentinel is <Y> and the decoded token is: erwachsene
The sentinel is <X> and the decoded token is: Cushion

…which means index 31999 in vocab is associated with the word “International“.

Why are we associating a valid vocab index with a sentinel? How is it working out?

Hi @chartechaccountant

The apparent overlap is just an implementation detail. In T5, the sentinel tokens (like <extra_id_0>, <extra_id_1>, etc.) are special tokens added after the main 32,000-token SentencePiece vocabulary. They reuse high-end vocab indices (e.g., 32,000–32,099) that may decode to random-looking words if you inspect them directly through SentencePiece, but during model training and inference, these indices are re-mapped internally to sentinel tokens. So even though the raw tokenizer might show “International”, the model actually treats that ID as <extra_id_0>, a placeholder, not a real word.

Hope it helps! Feel free to ask if you need further assistance.

Understood! Thank you. This also implies that the real “International“ will have some other vocab index.

Thank you!

You’re welcome! happy to help :raised_hands: