Does BloombergGPT contradict Chinchilla and Llama papers?

np1 · July 3, 2023, 11:28pm

While BloombergGPT started with the intention of following the Chinchilla recommendations of the number of tokens wrt the model size, they could not obtain that much training data in the finance domain. Why did they not reduce the model size to fit the available tokens to train on, as that would give them similar performance at a lower compute cost?

Also is not clear: “due to early stopping, the training process terminated after processing 569 billion tokens”. The Llama paper says they saw performance gains continue to a much larger number of tokens ("the performance of a 7B model continues to improve even after 1T tokens”).

Are other LLMs like BLOOM, PaLM, et al, also inhibited by unavailability of data and early stopping, and that’s why they were not trained on 20x tokens?

Thanks.

Juan_Olano · July 4, 2023, 12:23am

I don’t have any official answers to your questions. I tried to find information in the internet about it to no avail. I can only share my opinions on your questions:

Given that they didn’t have enough tokens to fit the 20X proposed by Chinchilla, why didn’t they reduce the size of the model to comply with the formula?
My opinion: I think that Chinchilla is a general guideline, not a strict rule, so by continuing with aprox 19X tokens, they were still within the margins of error (to call it somehow). Had they been at a much lower or higher point, I would question the project in this regards also, but I think that they were close enough.

Why “early stop” when longer training has proved to bring better performance? I can only guess that the metrics showed them that it was time to stop. May be with more training with the same data thy would risk overfitting.

Are other cases that have faced a lack of enough data? It is very possible, particularly in cases where the model is being trained in a very specific field. I guess in medicine, for instance, this may be happening to startups that are training big models with this corpus.

@np1 as disclaimed at the beginning, these are only personal opinions.

np1 · July 6, 2023, 8:55pm

Thanks, @Juan_Olano…

My understanding is for a 50B parameter model, they had only 14X tokens (700B), much less than the 20x suggested by Chinchilla. I agree, 19X would have been a good compromise to proceed with. So I wonder if they had reduced the model complexity to 700/20=35B (or even 569/20=29B) parameters, the model accuracy would be similar at a lower compute cost (as predicted by Chinchilla).

Consequently, they constructed a dataset containing just 700 billion tokens, less than the compute-optimal value. Furthermore, due to early stopping, the training process terminated after processing 569 billion tokens.

Juan_Olano · July 6, 2023, 11:06pm

Thanks for the reply @np1 . I think this will be hard to know. May be yes: according to chinchilla, may be reducing the number of params to approach the ratio could have a good impact, but may be the team had reasons to believe that keeping it like that was more beneficial. There are many moving parts in these projects, and this ratio is just one of them.

Kasowari · July 7, 2023, 4:28pm

“Introducing BloombergGPT, Bloomberg’s 50-billion parameter large language model, purpose-built from scratch for finance”

Given the model is not publicly accessible, it could be in the interest of its authors to keep the model size unnecessarily high (comparable to the size of other LLMs) for marketing purposes. This in order to convey the model’s “sophistication”, as a justification for the charged subscription fees.

Topic		Replies	Views
Optimal Number of Tokens Generative AI with Large Language Models week-1	5	518	February 9, 2024
On Scaling Laws and Compute-Optimal Models lecture Generative AI with Large Language Models week-1	2	444	June 30, 2023
Fine-tuning and chinchilla paper Generative AI with Large Language Models week-2	0	317	November 18, 2023
Right-Sizing Models for the Dataset: Finding the Best Data-To-Parameter Ratio for NLP Models AI Discussions the-batch , ai-discussions	1	75	May 20, 2023
Doubt regarding petaflop/s-day Generative AI with Large Language Models week-1	1	407	August 24, 2023

Does BloombergGPT contradict Chinchilla and Llama papers?

Related topics