On the “Scaling laws and compute-optimal models” video, the instructor says:
The relationship here holds as long as model size and training dataset size don’t inhibit the training process. Taken at face value, this would suggest that you can just increase
your compute budget to achieve better model performance.
Even when data and model parameters are fixed. Having trained ML models in the past (not LLMs) I’m failing to understand how the exact same data and the exact same model (parameters) can improve performance just by throwing more GPUs at the problem.
May it be that, since the model trains using self-supervised learning, model budget/compute power just means more rounds of training over and over the same dataset?