Augment pre train

I want to add new tokens to pre - training since I have some words that might not be in the original pre training corpus of words t5 was trained on.

Is that possible to do incremental pre training?

Or I have pretrain t5 from scratch?


Yes, it is possible to add new tokens to T5’s pre-training by fine-tuning the model on additional data that includes the new tokens. This process is called incremental pre-training or domain adaptation.

You can fine-tune the T5 model on your specific task and include the new tokens in the training data. This will allow the model to learn to generate outputs that include the new tokens.

However, if the new tokens represent a significantly different domain or language than what T5 was originally trained on, you may need to pre-train T5 from scratch on your new data. In this case, you would train the model on a large corpus of text that includes your new tokens, and then fine-tune the resulting model on your specific task.


Thanks for the response. Do you have any notebooks to share that has the code to incremental pre training


I’m sorry but I don’t have any. You would surely find some tutorials on github maybe.

A follow-up question - if you fine-tune an LLM on additional data that includes new tokens, are these new tokens added to the vocabulary of the LLM’s tokenizer? How are embeddings generated for these previously unseen tokens?