On Scaling Laws and Compute-Optimal Models lecture

in the lecture ’ Scaling Laws and Compute-Optimal Models’, it was mentioned that researchers found many over 100B models are ‘overly parameterized’ and could be benefitted from more training data. I think that is also the main point of Llama paper. But I always thought over-parameterization is the reason LLMs have good performance, i.e., they can generalize and memorize at the same time. This intuition seems contradicts the findings.

Another way to think about this is in terms of the classic over-fitting/under-fitting balance. If you have too many parameters, your model tends to overfit the data available, hence wild behavior may be expected. The goal is to keep the data and model size in balance and this is what the chinchilla law says. Of course, small deviations from the optimal balance is ok in many practical contexts but large deviations are not.

As far as I’m aware, the benefit of overparameterization relies on the fact that the number of parameters needs to be more than the number of data which contradicts the chinchilla rule. I don’t think classical over-fit/under-fit intuition works here.

some quick reference: https://www.youtube.com/watch?v=R29awq6jvUw