I was watching the video “Model selection and training/cross-validation/test sets” when this doubt arose.
In this video, Andrew says that in order to evaluate a model we should use an extra cross-validation set (or dev-set), and then choose the model which minimizes the error for this set. However, after that, he says that we should measure the generalization error with the test-set.
My question is the following:
- What happens if we find that the model that minimizes the error for the dev-set is different from the model that minimizes the error for the test-set, what model should we choose in this case?
I understand that we have to use the dev-set to choose the model to avoid having a bias (because maybe a model can predict better the values for the test-set than another), but in my opinion, this bias also appears since one model could be better to predict the values for the dev-set than another one. For me, the best solution for this problem is to use K-fold cross-validation and evaluate the models using the average of the errors that were obtained for each fold.
What do you think? Can anybody help me?
Thanks a lot!
We should point out that all these sets come from the same distribution, so if the model performs good at one part of the entire dataset but not good at other parts still its not a good model after all.
K-fold cross-validation is definitely a good technique but mostly applicable to small datasets, in large datasets its very inefficient computationally to do the process many times therefore he is suggesting to keep dev-set out of the entire distribution and representative of that distribution to test the model.
Ok. I understand that K-fold cross-validation may not be feasible.
However, as you pointed out if the model performs badly for a part of the dataset it means that it doesn’t seem to be a good model. So, what I’m asking to myself now is: Why we should separate a dev-set from the test-set? Having a greater test-set to evaluate and choose the models should be better, shouldn’t it?
The dev-set is the set where you tune hyper parameters of your model to make it fit better the training and dev-set consequently, you can say that the model also learns from the dev-set in some way. The test set is completely unseen from the model, now at this point we see how it performs “outside laboratory conditions”.
By this logic, why not extend the model to use more than one dev set? Say two or three or ten. Was it proven that one dev set is optimal? Curious.
In my opinion, first, data is a scarce resource, and we are only willing to give away a certain portion to build a cv set.
Second, further dividing the cv set into 10 sub-cv-set won’t give us a different outcome if we are to take the weighted average of the 10 sub-set’s metric results, when the metric is evaluated by adding up (or taking the mean of) the losses of individual samples.
If you have time, you can define a metric (e.g. mean square error), and make 20 pseudo cv set samples, giving each of them a true value and a predicted value. Calculate the metric value A for the 20 samples as a whole. Then divide the samples into 4 subsets, calculate a metric value for each subset, and then take the average of the 4 values to get measurement B, then you may compare A with B.
Thanks Raymond. I guess you are referring to the Central Limit Theorem in your suggestion?
We don’t need the CLT here, because algebra is enough to derive that A and B will be the same, not to mention that we are discussing only 4 subsets here