C2W3_Lab_01_Model_Evaluation_and_Selection: SGDRegressor vs. LinearRegression and ratio train/CV MSE

  1. Why not use SGDRegressor or something with regularization instead of LinearRegression once we have identified the best polynomial order? The reason I’m asking is because one can implement regularization to ensure we are not overfilling the data.

  2. What is the acceptable MSE ratio between train and CV

Hello @francktchafa,

For your first question, if we wanted to apply regularization, we should have done it in the process of identifying the best set of hyperparameters (including the polynomial order), but not after. We would have treated regularization as a tunable parameter, just like polynomial order, and test different sets of parameter configuration. I think for the purpose of demonstrating hyperparameter tuning (or model selection), it is sufficient for the lab to just use polynomial order. Considering for regularization, however, is a good idea and probably more useful when we have too many features, or even more suitable, when we are modeling with a non-linear, more flexible, multi-layer neural network that requires a network architecture predefined given insufficient prior information. Here in the lab, we have only one feature, and we are demonstrating model selection by showing some bad polynomial order choices that obviously are going to overfit the model, and perhaps this is why we don’t want to counteract ourselves by applying regularization. However, it is just for the purpose of demonstration, but in practice, we should have all options on the table.

For your second question, I have never heard of any acceptable range. The best we can have is a ratio close to 1, but being close to 1 doesn’t mean it is necessarily good. It is an indication of overfitting if CV MSE is much larger than Train MSE, so in this context, the ratio itself matters. However, even if they are very similar, if the MSEs are too large, it can be an indication of underfitting. I recommend you to review all lecture videos covering the topic of “Bias and Variance” from our MLS, or from the DLS Course 3 (which should be available for free in audit mode or on YouTube, and they shouldn’t require knowledge in Course 1 nor 2).


What do you mean by too large? Any value?

No. No particular value. The idea is, there is no hard number for telling you when is good or bad. It is problem dependent. I suggest you to watch the lectures in course 2 week 3 again because it gives you the idea.

Don’t look for a value because there is no such thing as a generally accepted MSE for every dataset, but watch the lectures for the idea of how to diagnose bias and variance problems.

Hello @francktchafa

The discussion of Bias and Variance offered by the course is not very maths oriented, but as you read more on the internet, you probably will somehow get to see the following formula,

(source, with the steps to derive it)

I am not going into the details of how it was derived nor the meaning of every symbols nor the equation itself, but if you can just accept the formula, there is a \sigma term there which is usually referred to as the " irreducible error" and that error can be contributed by the process when we collected the data. For example, if we are to collect image data of road signs in the streets but (1) we take photos in a moving car, (2) the road is bumpy, and (3) we are driving too fast so that the images are not always clear, then the more bumpy the road is and the faster we drive, the more likely the images are going to be blurred and the higher that irreducible error will be.

It is as simple as asking ourselves, “how can we expect any model that is trained on blurred images to well recognize how the signs look?”. We can’t, because there is some irreducible error in our training dataset that makes the model unable to see how the road signs actually look like.

Now, the irreducible error is part of that MSE (not specifically bias nor variance, though). If the irreducible error is large due to the poor condition of the data, then the acceptable MSE should be larger accordingly, agree? If the photos are super clear (which implies a lower irreducible error), then the acceptable MSE should be lower. This is a qualitative example for why there can’t be a generally acceptable value and it depends on your training dataset.


My last reply focused on the irreducible error, while your last reply quoted my previous reply on high bias but asked about a standard MSE value. My message is, as long as our dataset is not perfect (and there is no perfect real world dataset) and we know that the irreducible error is variable from dataset to dataset, it is pointless to discuss any absolute standard MSE value that will be valid for any dataset.

For when the bias is too large, I would just refer you back to the lectures as they have discussed that.

Thanks for this clarification!