Understanding Sentinels in C4W3_Assignment

/notebooks/C4W3_Assignment.ipynb

From what I understand, sentinels are used in the decoder of the T5 as targets. They are like placeholders.

Lets assume vocab size is 32,000.

Lets assume I=2, Love =3, learning=4, and =5

Eg sentence: “I love machine learning and deep learning”

After random selection, the words “machine“ and “deep“ are selected for masking, and the input to the encoder will be as follows:

“2 3 31998 4 5 31997 4“

The input to the decoder will be :

“31998 machine 31997 deep“

But, the vocab already has valid lookup in pos 31998 and 31999 as we have already seen in the get_sentinels() method, which prints out:

The sentinel is <Z> and the decoded token is: Internațional
The sentinel is <Y> and the decoded token is: erwachsene
The sentinel is <X> and the decoded token is: Cushion

…which means index 31999 in vocab is associated with the word “International“.

Why are we associating a valid vocab index with a sentinel? How is it working out?

Hi @chartechaccountant

The apparent overlap is just an implementation detail. In T5, the sentinel tokens (like <extra_id_0>, <extra_id_1>, etc.) are special tokens added after the main 32,000-token SentencePiece vocabulary. They reuse high-end vocab indices (e.g., 32,000–32,099) that may decode to random-looking words if you inspect them directly through SentencePiece, but during model training and inference, these indices are re-mapped internally to sentinel tokens. So even though the raw tokenizer might show “International”, the model actually treats that ID as <extra_id_0>, a placeholder, not a real word.

Hope it helps! Feel free to ask if you need further assistance.

1 Like

Understood! Thank you. This also implies that the real “International“ will have some other vocab index.

Thank you!

1 Like

You’re welcome! happy to help :raised_hands:

1 Like