How exactly adding more training examples can help to reduce overfitting?

By adding new data but without increase the polynomial degree of parameters the regression line or the Prediction line will be smoother as it will train on this data so the parameters or weight will change

Hi @Thala, let’s illustrate this! In below, from top to bottom, I plotted 3 types of dataset: (1) noise-free data; (2) noise-added data; and (3) more noise-added data.

In (1), noise-free data. the simplest model we need to model this data is y = w_1x + b, but even if we over-parameterize our model such as using y = w_1x + w_2x^2 + w_3x^3 + b, we still won’t overfit and after training, we will get w_2 \approx w_3 \approx 0 which is equivalent to the simplest model.

In (2), noise-added data. now with our over-parameterized model y = w_1x + w_2x^2 + w_3x^3 + b, it does the best to fit the model to the data including those points that are very deviated, resulting in a curve. However, such a curve is unwanted because the true underlying model should be a line (the dashed line), but because of the noise, it becomes a curve and therefore **we are fitting the model to the noise**! Here, we overfit the data.

In (3), more noisy data. This time, our over-parameterized model y = w_1x + w_2x^2 + w_3x^3 + b looks better and closer to a line. Why? Since our noise is random, with more data, it’s more likely to find data points to **“balance”** those highly deviated points.

For example, whenever the model wants to bend away to the left because there is a very deviated point on the **left hand side** of the dashed line (our true model), the model sees another point that is also very deviated **BUT it is on the right hand side**, so the model can’t bend too much away to the left because it needs to take care of the one on the right as well, so it has to stay somewhere in the middle, which is closer to the dashed line which is our true model! Here, we are much less overfitting to the noise and overfitting is reduced!

Cheers!

Raymond