What is the reason behind having test set and dev set?

As far as I have understood from the material so far is that training set is used for estimating the optimal values for weights and biases, the dev set is used for evaluating generalization error for model selection, and the test set is used for evaluating generalization error on unseen data after training is complete.

The rationale being, if we are selecting the model based on the generalization error of the test set, we are already being biased by what model performs best on the test set and not on some unseen data.

So, selecting the model based on the generalization error of the dev set lets us understand what the actual generalization error is when calculated on the test set after training is already complete.

Now, say I select the model based on the dev set. I finish training my model on the training set. Then, after calculating the generalization error on the test set, I find that the model performs poorly.

I would guess that I now rethink my approach and then use some other approach afterwards. If that is the case, am I not still being influenced by the results on the test set?

If I do not change my approach after poor results on the test set, why use the test set in the first place?

That’s not exactly what we’re doing.

We’re using the test set as a spot-check on how the completed model performs. If it’s not acceptable, then we can go back to square-one and create an entirely new model.

Note that the sets are all randomly selected from the pool of labeled examples. So you can always try many random splits using the same model architecture, train and validate anew, and then look at a whole family of test set results (each will be statistically independent). This would give a better statistical grip on how well the finished model works.

What is the difference between choosing the model based on the results from the test set during training and choosing a new model from square-one based on the results from the test set after completed model?

We’re not directly modifying the model by using the test set.
We’re either accepting the model, or rejecting it and starting over.

1 Like