@tbhaxor et al
I think folks on this thread are in violent agreement about this topic, but I saw a related question on StackExchange today that prompted me to go read a pretty famous text book I have a copy of, (see the link at the bottom)
Here is some great stuff from chapter 7. Model Assessment and Selection…
The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model.
Training error is the average loss over the training sample
\displaystyle err = \frac{1}{N}\sum_{i=1}^{N}L(y_i,\hat{f}(x_i))
We would like to know the expected test error of our estimated model \hat{f}.[NOTE these classes typically use \hat{y} instead of \hat{f}] As the model becomes more and more complex, it uses the training data more and is able to adapt to more complicated underlying structures. Hence there is a decrease in bias but an increase in variance. There is some intermediate model complexity that gives minimum expected test error.
Typically our model will have a tuning parameter or parameters \alpha and so we can write our predictions as \hat{f}_\alpha (x). The tuning parameter varies the complexity of our model, and we wish to find the value of \alpha that minimizes error, that is, produces the minimum of the average test error.
It is important to note that there are in fact two separate goals that we might have in mind:
-
Model selection: estimating the performance of different models in order to choose the best one.
-
Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.
If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis. Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then the test set error of the final chosen model will underestimate the true test error, sometimes substantially.
It is difficult to give a general rule on how to choose the number of observations in each of the three parts, as this depends on the signal-to- noise ratio in the data and the training sample size. A typical split might be 50% for training, and 25% each for validation and testing.
Yeah, that.
Models are trained on the training set. Then the validation set is used to perform model selection by estimating prediction error of the models trained with differing \alpha. Without a validation set, one might select a model based only on minimizing training error, which leads to overfitting ie high variance. Finally, the test set is used to perform model assessment, estimating the generalization error of the selected model.
Elements of Statistical Learning, 2^{nd} Edition. Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.