hi,

suppose I did all I can with parameter adjustments (regularization, degree of polynomial) and still I observe that my linear regression model has high variance.

so one solution Andrew proposes is to feed more data to my training process. Finding more data is an obvious solution but what about changing the split rate?

what if instead of 60-20-20 split for training CV and test data, I re-split with a ration such as 80-10-10 and expect to resolve overfitting ?

will this be cheating ? or is it a good way to adress the problem?

one issue that I believe with this approach will be the problem of represantative power of the CV and training data sets for the real world. After all, we use the CV and training splits to establish a model to represent the real world data rather than just the training data.

so maybe the new split will resolve the issue of overfitting in theory but in practice still my model wouldnt be a good one for the real world since the training data sample is not representative.

huh.

how do we know that even the overall data is big enough to be represantative of the real world scenaria?

maybe the data at our hand is full of samples with specific features such as houses

from a very high income neighbourhood. a ‘just right model’ from this data would not be just right even if cv and training errors are way below than baseline scenarios.

I guess this is an issue that andrew doesnt cover as of Course 2 week3. Maybe he will in other courses… will he?

anyway…I wonder what you guys think about this discussion.

warmly, Mehmet