How to handle new Tokens

Objective: I would like to fine-tune a model using one of my chatapp group chats as a dataset. This group chat consists of multiple members, and my goal is to train the model to predict a specific member’s chat message based on a context provided by previous messages.

Challenges: However, there is a unique challenge. The group messages in this chat are a mix of English and Telugu languages, with Telugu words typed in the English script. As a result, some of these English-written Telugu words may not be found in the target tokenizer’s vocabulary.

Question: How can I handle these new tokens when fine-tuning the model to ensure accurate predictions despite the presence of these language variations?

Can you pre-process the telugu entries, translating to english, before feeding it to the model?

Yes I can, so that can be the best solution ?

I don’t know if this is the best solution, but it is certainly one valid solution :slight_smile: