Hi guys,
According to the Scaling laws and compute-optimal models from week 1
Scaling Laws and compute-optimal models
In this lecture, it was mentioned that “The optimal number of tokens required/preferred according to chinchilla research paper is 20 times the model parameters” What I would like to understand is whether this is applied only for the pre-training? Where you train a model from scratch, or is this also applicable for finetuning? Sorry if this is a baseless question Just trying to understand how I can leverage the Chinchilla Research for a translation model that I am building.
Hoping to get a response.
Thanks!!