on the section Packing the data (lesson 4 - packaging-data-for-pretraining) we are reshaping the tokenized inputs using max sequence length, there we discard extra tokens from end of the list so number of tokens is exactly divisible by max_seq_length
Then the last example won’t have EOS token, right ? (if total number of tokens is not a multiple of max_seq_length
)
Will it not going to mislead the model training ?
what do you mean by this? if you are stating that EOS is outside the max sequence length, then we add an if statement to check if the output was the [EOS] token and we break the generation loop.
i was referring to the training data creation part(not generation).
if we follow this, the last example won’t have EOS token, will it not going to mislead the model training ?
This was my doubt.
if you notice the packed pretrained data uses the dataset dictionary which will include tokenizer sentences with defined input_ids without eos, and when the model is trained, it would be used with max token or sequence length with the if statement for eos, which will train the model successfully once the output doesn’t match with prompt condition of token and sequence condition and model train will stop once the dataset dictionary is run down and stops when it notices eos but will only provide output according to the input_ids_list and pad token which were not used.
The step in the image shared by you is only trying to explain how the tokenized sentence are shaped, that doesn’t mean, model doesn’t dataset which has defined bos and eos.
1 Like