C2_W3 Model selection and training/cross validation/test sets


In week 3, in the video about “Model selection and training/cross validation/test sets”, Professor Ng says the following:

Finally, if you want to report out an estimate of the generalization error of how
well this model will do on new data. You will do so using that third subset of your data,
the test set and you report out Jtest of w4,b4. You notice that throughout this entire procedure, you had fit these parameters using the training set. You then chose the parameter d or chose the degree of polynomial using the cross-validation set and so up until this point, you’ve not fit any parameters, either w or b or d to the test set and that’s why Jtest in this example will be fair estimate of the generalization error of this model thus parameters w4,b4. This gives a better procedure for model selection and it lets you automatically make a decision like what order polynomial to choose for your linear regression model. This model selection procedure also works for choosing among other types of models. For example, choosing a neural network architecture. If you are fitting a model for handwritten digit recognition, you might consider three models like this, maybe even a larger set of models than just me but here are a few different neural networks of small, somewhat larger, and then even larger. To help you decide how many layers do the neural network have and how many hidden units per layer should you have, you can then train all three of these models and end up with parameters w1, b1 for the first model, w2, b2 for the second model, and w3,b3 for the third model. You can then evaluate the neural networks performance using Jcv, using your cross-validation set. Since this is a classification problem, Jcv the most common choice would be to compute this as the fraction of cross-validation examples that the algorithm has misclassified. You would compute this using all three models and then pick the model with the lowest cross validation error. If in this example, this has the lowest cross validation error, you will then pick the second neural network and use parameters trained on this model and finally, if you want to report out an estimate of the generalization error, you then use the test set to estimate how well the neural network that you just chose will do. It’s considered best practice in machine learning that if you have to make decisions about your model, such as fitting parameters or choosing the model architecture, such as neural network architecture or degree of polynomial if you’re fitting a linear regression, to make all those decisions only using your training set and your cross-validation set, and to not look at the test set at all while you’re still making decisions regarding your learning algorithm. It’s only after you’ve come up with one model as your final model to only then evaluate it on the test set and because you haven’t made any decisions using the test set, that ensures that your test set is a fair and not overly optimistic estimate of how well your model will generalize to new data.

My question is: does this also apply when comparing completely different types of learning models, such as comparing between a neural network and a random forest, for example?

If that’s the case, I’d appreciate it if someone could explain to me why that is. The way I see it, I’d use the training set to fit the neural network and the random forest, the cross-validation set to tune the hyper-parameters, architecture, etc, of each model and then I would compare their performance based on the test set. I wouldn’t be using the test set to make any decisions regarding either type of model, I would only use the test set to compare their performances.

Hi @Andromeda18,

As you quoted,

it’s the practice to use the training set and the cv set to make all decisions including hyper-parameter tuning. I have only one comment regarding this:

If the purpose of performances comparison is to choose which model does better and to adopt one of them, this is also a kind of decision-making and should also be made on the cv set. Of course, if you are just comparing and then not for picking the best of them, it’s fine.


You’re right that choosing a model is a kind of decision-making, but it seems to me it’s not exactly the same thing as choosing the degree of a polynomial, for example. When you’re tuning a model’s hyper-parameters, you’re using the CV data to try different values for the hyper-parameters and because of that the performance of the model on the CV data will be biased. However, when comparing the performance of different models you’re not using any data to learn or pick any parameter. As such, doesn’t it make sense to use the unbiased estimates of the models’ performances (i.e., Jtest) to compare them?

I see your point, however, picking a decision tree model over a linear regression model, for example, is nothing too different from picking a linear regression model of degree 3 over one of degree 2, because a decision tree, a linear model of degree 3, and a linear model of degree 2 are just three different model assumptions to the training data.

No one would hesitate to say that the degree value is a tunable hyper parameter, but in a bigger picture, the choice of model assumption is actually the hyper parameter here. You might consider this hyper parameter as F where F is a functional form rather then a value and such that you are assuming y=F_w(x) to your training data, where the additional w represents a collection of all trainable model parameters included in the functional form. Now you can tune this hyperparameter F to be a decision tree model assumption, a 2nd order linear model, or a 3rd order linear model.

Does this make sense to you?


Yeah, actually it does. I’ve been thinking about this a lot, and sometimes I still have some doubts, but you’re right when you say that choosing between model types based on the test data is no different than choosing the order of a polynomial. I find it that it helps me to think about this in terms of having a model in production. When a model is deployed, that model is the result of all my decisions, including deciding between model types, and the data the model sees after being deployed wasn’t used in any stage of the learning process. The test set should simulate precisely that. Thanks for your help!

Very well summarized @Andromeda18 :wink:

Hi @shanup,

Both you and Raymond are definitely right about this, although I have seen many research papers where the test set is used to choose between different types of models. I understand now why that’s not the right approach. Thanks for your help.

You are most welcome @Andromeda18

Hi Raymond,

I have 2 questions regarding to this slide from the course:

1)why ‘d’ deemed as an extra parameter? in the chosen model itself (where d=5), the parameters seem still to be w1, w2…w5 and b, d is not included in this model.
2)why introducing an extra parameter d would reduce the J_test?

Lastly you alluded a term " hyper parameter" in your previous answers, what hyper parameter means and what it is different from model parameters W?

As always, many thanks for your help.
Christina Fan

Hello @Christina_Fan,

d is a hyper-parameter which is a parameter that you can’t train with gradient descent. You need to manually select a value for it. I suggest this for a more complete discussion.

In this slide, d=5 implies more trainable weights than d=1, and more trainable weights mean more freedom for the model to be fit well to the training set. If it is not “too well”, meaning not overfitting, then you can expect a better J_{test}.


1 Like

Brilliant, now I get it :smiling_face: Thank you Raymond.

1 Like

Sure, @Christina_Fan!