Augment pre train

harishvs · July 29, 2023, 8:09pm

I want to add new tokens to pre - training since I have some words that might not be in the original pre training corpus of words t5 was trained on.

Is that possible to do incremental pre training?

Or I have pretrain t5 from scratch?

Thanks
Harish

Atharva_Divekar · August 3, 2023, 5:46pm

Yes, it is possible to add new tokens to T5’s pre-training by fine-tuning the model on additional data that includes the new tokens. This process is called incremental pre-training or domain adaptation.

You can fine-tune the T5 model on your specific task and include the new tokens in the training data. This will allow the model to learn to generate outputs that include the new tokens.

However, if the new tokens represent a significantly different domain or language than what T5 was originally trained on, you may need to pre-train T5 from scratch on your new data. In this case, you would train the model on a large corpus of text that includes your new tokens, and then fine-tune the resulting model on your specific task.

harishvs · August 3, 2023, 7:53pm

Atharva,

Thanks for the response. Do you have any notebooks to share that has the code to incremental pre training

Thanks
Harish

Atharva_Divekar · August 5, 2023, 5:25pm

I’m sorry but I don’t have any. You would surely find some tutorials on github maybe.

Shaffique_Aljoofri · August 17, 2023, 3:46am

A follow-up question - if you fine-tune an LLM on additional data that includes new tokens, are these new tokens added to the vocabulary of the LLM’s tokenizer? How are embeddings generated for these previously unseen tokens?

Topic		Replies	Views
How to handle new Tokens Generative AI with Large Language Models week-module-2	3	375	September 8, 2023
Tokenization in post-training slide Fine-tuning & RL for LLMs: Intro to Post-training week-module-2 , dl-ai-learning-platform	1	55	January 22, 2026
PEFT/LoRA used for domain addaptation Generative AI with Large Language Models week-module-2	1	515	July 10, 2023
How do we practically perform fine tuning & pretraining on an existing LLM model GenAI4E Resources	4	218	July 29, 2025
The uses of Tokenizer Generative AI with Large Language Models week-module-1	1	387	October 2, 2023

Augment pre train

Related topics