A model performs well on the data it has seen, but poorly on unseen data.
However, in practice, during model training we use a validation set for model selection (for example, tuning the learning rate, model size, number of epochs, etc.).
Doesn’t that mean the validation set has actually been used to adjust the model?
If so, can it still be considered “unseen” data?
In Andrew Ng’s lecture slides, he uses the comparison between training error and validation error to diagnose training overfitting (e.g., low training error but high validation error).
Doesn’t that imply that the validation set is considered unseen data?
But if the validation set has already been involved in model selection, can it truly be regarded as unseen?
So I’m confused:
Is the validation set considered “seen” or “unseen”?
How should we think about this distinction in theory versus in practice?
However, Andrew Ng describes overfitting as a situation where a model performs very well on the training set, but poorly on the validation set.
This seems to imply that the validation set is also considered “unseen data.”
I am still confused. Why is the validation set treated as unseen data, even though it is used for model selection and hyperparameter tuning?
This is a beginner level course. I don’t think Andrew has introduced the concept of the test set yet. It comes later.
You are correct that the validation set has an impact on the training results. It is used to adjust the hyperparameters, but the features of the validation set data are not directly involved in the training process to set the weight values. So the validation set is partly “unseen data”, and partly visible data.
The complete answer is that the truly “unseen data” is the test set.
Right! Maybe we could say the same thing with slightly different wording and see if it helps at all:
The point is that we don’t actually train the model on the validation dataset: we only use the validation data to check (“validate”) the performance of the trained model and decide whether the performance is bad enough that we need to try the training again with different hyperparameters. So the validation data does influence the final result, but it’s not actually ever computing gradients from the validation data. We only use it in “inference” mode to calculate prediction accuracy on that data.
Overfitting as a general term just means that the actual training data gives better results from the model (prediction accuracy) than some other data on which the model was not actually trained (meaning gradient descent was not run using that other data).
Thanks for your reply,
But now, suppose we have a test set. If the model performs well on both the training and validation sets, but performs poorly on the test set, does that mean we have overfitted to the validation set?
In this situation, both the validation and test sets were not used to compute gradients (i.e., gradient descent was not run on either of them).
Or you mean that when you say “overfitting” here, you are specifically referring to training overfitting — that is, the situation where the model performs much better on the training set than on data that was not used for gradient descent?
In other words, you are not referring to other forms such as overfitting to the validation set, or the test set?
As Tom mentioned earlier in this thread, this is a beginning series of courses. Rather than spending more mental energy debating the meaning of the terminology here, my suggestion would be to finish the courses in MLS and then take DLS. The issues around overfitting and underfitting and various strategies for dealing with those situations will be discussed in more detail in DLS C2 and C3 and beyond.