Why this code show up in C4W3 graded function

hello
this is the first exercise.
the sentence used is
input_str = ‘Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers.’
the tokenized output is

okenized inputs - shape=53:

[31999 15068  4501     3 12297  3399    16  5964  7115 31998   531    25
   241    12   129   394    44   492 31997    58   148    56    43     8
  1004     6   474 31996    39  4793   230     5  2721     6  1600  1630
 31995  1150  4501 15068 16127     6  9137  2659  5595 31994   782  3624
 14627    15 12612   277     5]
not clear to me why 31999 is in this input codes.  the 31999 is the code for the word "international".  The word international is not in the sentence.  I double checked and dekotenized  using the tokenizer  31999. 
please advise.

it seems these value 31999, 319998,…319994 are indicating where the word that got masked in the input. and the target they indicate the start of the mask. Am i getting correctly ?

I think thats right!

The appearance of the token 31999 (which you mentioned corresponds to the word “international”) in the tokenized output is indeed puzzling if the word “international” does not appear in the input text. Here are a few possible explanations for this issue:

1. Special Tokens or Reserved IDs:

  • Sometimes, tokenizers use special tokens for things like padding, unknown words, or beginning/end of a sentence. The number 31999 might be a special token that has been incorrectly mapped to “international” during detokenization.

2. Encoding or Decoding Issue:

  • There might be an issue with the encoding or decoding process where a special token got misinterpreted as 31999. This could happen if the tokenizer accidentally maps a special sequence or character to this index.

3. Tokenization Error:

  • It’s possible there was an error in the tokenization process itself, which led to the wrong token being generated. This might be due to an issue with the model or the tokenizer library.

4. Tokenizer Configuration:

  • Check if the tokenizer is configured to include special tokens or has a specific vocabulary mapping. Sometimes, tokenizers are pre-configured to handle specific scenarios like adding special tokens, which might be the reason for this anomaly.

Troubleshooting Steps:

  1. Check the Tokenizer Configuration:
  • Verify if there are any special tokens or configurations that could lead to the inclusion of 31999 in the tokenized sequence.
  1. Manual Detokenization:
  • Manually detokenize the sequence to check if 31999 indeed corresponds to “international” in this context, or if it’s being mapped to something else.
  1. Different Tokenizer:
  • Try using a different tokenizer or a different model to see if the issue persists.
  1. Inspect the Input Text:
  • Ensure there are no hidden characters or encoded characters in the input text that might be triggering this behavior.

thank you.

this is to confirm that that 3199x are used for sentinels.
please compare the ids with the text.
31999 is and 31998 is and so on.
Not sure why the designer of the program is coming up with such complicated rather than using an sentinel

Not sure why such complicated scheme rather than generating independent sentinels.

input
[31999 15068 4501 3 12297 3399 16 5964 7115 31998 531 25
241 12 129 394 44 492 31997 58 148 56 43 8
1004 6 474 31996 39 4793 230 5 2721 6 1600 1630
31995 1150 4501 15068 16127 6 9137 2659 5595 31994 782 3624
14627 15 12612 277 5]

targets - shape=19:

[31999 12847 277 31998 9 55 31997 3326 15068 31996 48 30
31995 727 1715 31994 45 301 1]


Inputs: 

 b'<Z> BBQ Class Taking Place in Missoul <Y> Do you want to get better at making <X>? You will have the opportunity, put <W> your calendar now. Thursday, September 22 <V> World Class BBQ Champion, Tony Balay <U>onestar Smoke Rangers.'

Targets: 

 b'<Z> Beginners <Y>a! <X> delicious BBQ <W> this on <V>nd join <U> from L'

sentinel list 
The sentinel is <Z> and the decoded token is: Internațional # this is what was 31999 before replacing International. 
The sentinel is <Y> and the decoded token is: erwachsene
The sentinel is <X> and the decoded token is: Cushion
The sentinel is <W> and the decoded token is: imunitar
The sentinel is <V> and the decoded token is: Intellectual
The sentinel is <U> and the decoded token is: traditi
The sentinel is <T> and the decoded token is: disguise
The sentinel is <S> and the decoded token is: exerce
The sentinel is <R> and the decoded token is: nourishe
The sentinel is <Q> and the decoded token is: predominant
The sentinel is <P> and the decoded token is: amitié
The sentinel is <O> and the decoded token is: erkennt
The sentinel is <N> and the decoded token is: dimension
The sentinel is <M> and the decoded token is: inférieur
The sentinel is <L> and the decoded token is: refugi
The sentinel is <K> and the decoded token is: cheddar
The sentinel is <J> and the decoded token is: unterlieg
The sentinel is <I> and the decoded token is: garanteaz
The sentinel is <H> and the decoded token is: făcute
The sentinel is <G> and the decoded token is: réglage
The sentinel is <F> and the decoded token is: pedepse
The sentinel is <E> and the decoded token is: Germain
The sentinel is <D> and the decoded token is: distinctly
The sentinel is <C> and the decoded token is: Schraub
The sentinel is <B> and the decoded token is: emanat
The sentinel is <A> and the decoded token is: trimestre
The sentinel is <z> and the decoded token is: disrespect
The sentinel is <y> and the decoded token is: Erasmus
The sentinel is <x> and the decoded token is: Australia
The sentinel is <w> and the decoded token is: permeabil
The sentinel is <v> and the decoded token is: deseori
The sentinel is <u> and the decoded token is: manipulated
The sentinel is <t> and the decoded token is: suggér
The sentinel is <s> and the decoded token is: corespund
The sentinel is <r> and the decoded token is: nitro
The sentinel is <q> and the decoded token is: oyons
The sentinel is <p> and the decoded token is: Account
The sentinel is <o> and the decoded token is: échéan
The sentinel is <n> and the decoded token is: laundering
The sentinel is <m> and the decoded token is: genealogy
The sentinel is <l> and the decoded token is: QuickBooks
The sentinel is <k> and the decoded token is: constituted
The sentinel is <j> and the decoded token is: Fertigung
The sentinel is <i> and the decoded token is: goutte
The sentinel is <h> and the decoded token is: regulă
The sentinel is <g> and the decoded token is: overwhelmingly
The sentinel is <f> and the decoded token is: émerg
The sentinel is <e> and the decoded token is: broyeur
The sentinel is <d> and the decoded token is: povești
The sentinel is <c> and the decoded token is: emulator
The sentinel is <b> and the decoded token is: halloween
The sentinel is <a> and the decoded token is: combustibil

Hi @Fares_Bagh

The idea or reason behind adding special token to a language model or translation model comes when we want to mask some tokens while training a question-answer model. Although the tokenizer is able to detect these special tokens and what it signifies such as International as you mentioned but when it comes to translation model it only tries to go through all the tokens and find some tokens are different from the input tokens, so the accuracy of translating a language or given input gets better.

Masking also prevents the attention mechanism in the decoder when training.

Another significance of masking in language model is adding any future tokenizer words to the language models.

When you do further assignment you will understand more why these special or unused token were used in language models.

Regards
DP

thank you deepti. nice new avatar

The new avatar is because we doctors of india standing in support of rape and murder victim of doctor on duty in India happened recently in Kolkata.

Keep Learning!!!

Regards
DP

You have my support!!!

1 Like

and paying close attention I see the details of the avatar and what it trying to communicate :frowning:

@Fares_Bagh just to stand up for her a little, I would also add I’m not sure it is wise to flirt with a Mentor. Even I don’t try to do that.

1 Like

He wasn’t flirting @Nevermnd :rofl:

even if it was it is no use.

I don’t mix up my world of personal and professional space.

Although thanks for the protection.

Regards
DP