Optimal Number of Tokens

Hi guys,

According to the Scaling laws and compute-optimal models from week 1
Scaling Laws and compute-optimal models

In this lecture, it was mentioned that “The optimal number of tokens required/preferred according to chinchilla research paper is 20 times the model parameters” What I would like to understand is whether this is applied only for the pre-training? Where you train a model from scratch, or is this also applicable for finetuning? Sorry if this is a baseless question Just trying to understand how I can leverage the Chinchilla Research for a translation model that I am building.

Hoping to get a response.


1 Like

This is referring to the training phase of model itself, before it is deployed to make predictions!


Thank you @gent.spah

1 Like

Do you have any recommendations for a translation model? I would like to use an existing model to fine-tune this model with internal translation data. @gent.spah

1 Like

Not really but check on the web or tensorflow hub!

@gent.spah Sure, Thank you. Also, one other question could you please tell me what sort of cleaning techniques I would have to apply to clean Translation data? Or if you could point me to a resource that would be really helpful. I tried searching online but couldn’t find anything reliable.

1 Like