hello ,
in the tokenize_and_mask function in the assignment for question answering. sentinel indices start from upper end of vocabulary, my problem is that what happens if in some text , there is actually a word that it’s corresponding index is vocab_size -1 or something like that, it would be considered as sentinel id , right ?
Hi @parsapico
In the tokenize_and_mask
function, sentinel indices starting from the upper end of the vocabulary are designed to avoid conflicts with actual word indices. If your vocabulary is managed correctly, words should not have indices overlapping with the reserved sentinel indices. Ensure that the vocabulary size is set such that there is a clear distinction between actual word indices and sentinel indices to prevent any conflicts.
Hope it helps!
i don’t know if you ran the code but for example first sentinel id is actually this word ‘Internațional’ in vocab, so when we put a placeholder value we are actually giving it the id for this word. and nowhere in the code they padded vocab size to consider sentinel ids