While BloombergGPT started with the intention of following the Chinchilla recommendations of the number of tokens wrt the model size, they could not obtain that much training data in the finance domain. Why did they not reduce the model size to fit the available tokens to train on, as that would give them similar performance at a lower compute cost?
Also is not clear: “due to early stopping, the training process terminated after processing 569 billion tokens”. The Llama paper says they saw performance gains continue to a much larger number of tokens ("the performance of a 7B model continues to improve even after 1T tokens”).
Are other LLMs like BLOOM, PaLM, et al, also inhibited by unavailability of data and early stopping, and that’s why they were not trained on 20x tokens?
I don’t have any official answers to your questions. I tried to find information in the internet about it to no avail. I can only share my opinions on your questions:
Given that they didn’t have enough tokens to fit the 20X proposed by Chinchilla, why didn’t they reduce the size of the model to comply with the formula?
My opinion: I think that Chinchilla is a general guideline, not a strict rule, so by continuing with aprox 19X tokens, they were still within the margins of error (to call it somehow). Had they been at a much lower or higher point, I would question the project in this regards also, but I think that they were close enough.
Why “early stop” when longer training has proved to bring better performance? I can only guess that the metrics showed them that it was time to stop. May be with more training with the same data thy would risk overfitting.
Are other cases that have faced a lack of enough data? It is very possible, particularly in cases where the model is being trained in a very specific field. I guess in medicine, for instance, this may be happening to startups that are training big models with this corpus.
@np1 as disclaimed at the beginning, these are only personal opinions.
My understanding is for a 50B parameter model, they had only 14X tokens (700B), much less than the 20x suggested by Chinchilla. I agree, 19X would have been a good compromise to proceed with. So I wonder if they had reduced the model complexity to 700/20=35B (or even 569/20=29B) parameters, the model accuracy would be similar at a lower compute cost (as predicted by Chinchilla).
Consequently, they constructed a dataset containing just 700 billion tokens, less than the compute-optimal value. Furthermore, due to early stopping, the training process terminated after processing 569 billion tokens.
Thanks for the reply @np1 . I think this will be hard to know. May be yes: according to chinchilla, may be reducing the number of params to approach the ratio could have a good impact, but may be the team had reasons to believe that keeping it like that was more beneficial. There are many moving parts in these projects, and this ratio is just one of them.
“Introducing BloombergGPT, Bloomberg’s 50-billion parameter large language model, purpose-built from scratch for finance”
Given the model is not publicly accessible, it could be in the interest of its authors to keep the model size unnecessarily high (comparable to the size of other LLMs) for marketing purposes. This in order to convey the model’s “sophistication”, as a justification for the charged subscription fees.