Adressing variance with more data and the representative power of the Data at hand


suppose I did all I can with parameter adjustments (regularization, degree of polynomial) and still I observe that my linear regression model has high variance.
so one solution Andrew proposes is to feed more data to my training process. Finding more data is an obvious solution but what about changing the split rate?

what if instead of 60-20-20 split for training CV and test data, I re-split with a ration such as 80-10-10 and expect to resolve overfitting ?

will this be cheating :slight_smile: ? or is it a good way to adress the problem?
one issue that I believe with this approach will be the problem of represantative power of the CV and training data sets for the real world. After all, we use the CV and training splits to establish a model to represent the real world data rather than just the training data.

so maybe the new split will resolve the issue of overfitting in theory but in practice still my model wouldnt be a good one for the real world since the training data sample is not representative.
how do we know that even the overall data is big enough to be represantative of the real world scenaria?
maybe the data at our hand is full of samples with specific features such as houses
from a very high income neighbourhood. a ‘just right model’ from this data would not be just right even if cv and training errors are way below than baseline scenarios.

I guess this is an issue that andrew doesnt cover as of Course 2 week3. Maybe he will in other courses… will he?

anyway…I wonder what you guys think about this discussion.

warmly, Mehmet

1 Like

Hi Mehmet,

That’s a good question. Usually, resplitting the data into 80-10-10 is a common ratio to prevent overfitting. However, it also depends on how large your data is. To ensure the test set is representative of the real world, you also need to look into feature engineering and other elements in the data.

For example, how balanced is the ratio of data variables in your test set? Take the housing price dataset, if your training set contains more variables of the high-income household than the test set. The model will likely predict better on that type of data and will not be practical in real life. One useful method is stratified sampling

Another hidden issue is data leaking. Make sure you select the feature that is ONLY available before the prediction.

There are many resources about hands-on machine learning that you can look through and see how people develop models for real-life projects.

Hope that help

1 Like

these are all wonderfull resources to look at, Nguyen(if I may)
thank you for taking the time to provide these resources. I will read them!

1 Like