Why not evaluate models on test set?

djdevilliers · July 11, 2022, 2:18pm

I understand the train set is used to learn W and b, then the dev set is used to evaluate and compare different models (find the most suitable hyperparameters). I don’t understand why we need to do this step on a separate dev data set. Why not just evaluate multiple models on the test data and pick the one that performs best? What is the purpose/benefit of the additional data set?

gent.spah · July 11, 2022, 3:47pm

In a way you can say that the model also learns from and fits the dev set too, by adjusting hyperparameters, model structure, data distribution using the dev set. The test set is entirely unknown to the model. You may do as you say but its more like a “get lucky” approach rather than trial, test and further learning approach as is using the dev set. The dev set in general will improve performance much faster that it would be without it and in controlled way also.

alvaroramajo · July 11, 2022, 4:56pm

Hi, @djdevilliers.

This is a recurrent question in this forum. As @gent.spah says, it can also overfit the dev set.
Check this post for more details if you want.

djdevilliers · July 12, 2022, 4:17am

Thank you for the info. I did read that thread and it didn’t answer my question.

What I’m trying to say is that ultimately the model would be “fit” to test data too. If I skip the dev set and just evaluate several models M(i) … M(j) against the test set then pick the best model M(k), in a way I have indirectly “fit” to the test data by picking the best performing model. If I then babysit that model and tune its hyperparameters I would further “fit” the model M(k) every time I change a hyperparameter that improves its performance, albeit indirectly. The model hasn’t directly seen the test data but some knowledge of its performance on the test data has leaked (through me) back into the hyperparameter choices.

alvaroramajo · July 12, 2022, 4:10pm

Yes, you are right. The only thing about the third dataset is to check how the model would perform in production, but just as a final informative metric. The decisions and hyperparameter tuning would be done with the dev set.

Topic		Replies	Views
What is the use of the test set if the dev and test set come from the same source? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	6	666	September 12, 2023
Differences in Dev and Test set Improving Deep Neural Networks: Hyperparameter tun coursera-platform	6	658	September 23, 2021
About dev set and test set Structuring Machine Learning Projects coursera-platform	1	593	May 26, 2022
Train/Dev/Test explanation Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	956	September 25, 2021
Selecting the right model Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	534	July 5, 2022

Why not evaluate models on test set?

Related topics