Why not evaluate models on test set?

I understand the train set is used to learn W and b, then the dev set is used to evaluate and compare different models (find the most suitable hyperparameters). I don’t understand why we need to do this step on a separate dev data set. Why not just evaluate multiple models on the test data and pick the one that performs best? What is the purpose/benefit of the additional data set?

In a way you can say that the model also learns from and fits the dev set too, by adjusting hyperparameters, model structure, data distribution using the dev set. The test set is entirely unknown to the model. You may do as you say but its more like a “get lucky” approach rather than trial, test and further learning approach as is using the dev set. The dev set in general will improve performance much faster that it would be without it and in controlled way also.

1 Like

Hi, @djdevilliers.

This is a recurrent question in this forum. As @gent.spah says, it can also overfit the dev set.
Check this post for more details if you want.

1 Like

Thank you for the info. I did read that thread and it didn’t answer my question.

What I’m trying to say is that ultimately the model would be “fit” to test data too. If I skip the dev set and just evaluate several models M(i) … M(j) against the test set then pick the best model M(k), in a way I have indirectly “fit” to the test data by picking the best performing model. If I then babysit that model and tune its hyperparameters I would further “fit” the model M(k) every time I change a hyperparameter that improves its performance, albeit indirectly. The model hasn’t directly seen the test data but some knowledge of its performance on the test data has leaked (through me) back into the hyperparameter choices.

Yes, you are right. The only thing about the third dataset is to check how the model would perform in production, but just as a final informative metric. The decisions and hyperparameter tuning would be done with the dev set.