Packing the data with max sequence length

Ansil_M_B · December 18, 2024, 3:46am

on the section Packing the data (lesson 4 - packaging-data-for-pretraining) we are reshaping the tokenized inputs using max sequence length, there we discard extra tokens from end of the list so number of tokens is exactly divisible by max_seq_length

Then the last example won’t have EOS token, right ? (if total number of tokens is not a multiple of max_seq_length )
Will it not going to mislead the model training ?

Deepti_Prasad · December 18, 2024, 5:15am

what do you mean by this? if you are stating that EOS is outside the max sequence length, then we add an if statement to check if the output was the [EOS] token and we break the generation loop.

Ansil_M_B · December 18, 2024, 3:27pm

i was referring to the training data creation part(not generation).

if we follow this, the last example won’t have EOS token, will it not going to mislead the model training ?
This was my doubt.

Deepti_Prasad · December 19, 2024, 7:20am

if you notice the packed pretrained data uses the dataset dictionary which will include tokenizer sentences with defined input_ids without eos, and when the model is trained, it would be used with max token or sequence length with the if statement for eos, which will train the model successfully once the output doesn’t match with prompt condition of token and sequence condition and model train will stop once the dataset dictionary is run down and stops when it notices eos but will only provide output according to the input_ids_list and pad token which were not used.

The step in the image shared by you is only trying to explain how the tokenized sentence are shaped, that doesn’t mean, model doesn’t dataset which has defined bos and eos.

Topic		Replies	Views
About packaging data: NEXT SENTECE Pretraining LLMs	0	42	July 19, 2024
04_Data_preparation_lab_student - tokenize_function Finetuning Large Language Models	0	131	February 7, 2024
Module2 Graded Lab1 padding and eos tokens Fine-tuning & RL for LLMs: Intro to Post-training week-module-2 , dl-ai-learning-platform	13	67	December 19, 2025
04_Data_preparation_lab_student Finetuning Large Language Models project	0	10	February 27, 2026
Can I get a hint! Max len of the longest tweet NLP NLP with Sequence Models week-module-1	2	373	January 8, 2024

Packing the data with max sequence length

Related topics