As far as I have understood from the material so far is that training set is used for estimating the optimal values for weights and biases, the dev set is used for evaluating generalization error for model selection, and the test set is used for evaluating generalization error on unseen data after training is complete.
The rationale being, if we are selecting the model based on the generalization error of the test set, we are already being biased by what model performs best on the test set and not on some unseen data.
So, selecting the model based on the generalization error of the dev set lets us understand what the actual generalization error is when calculated on the test set after training is already complete.
Now, say I select the model based on the dev set. I finish training my model on the training set. Then, after calculating the generalization error on the test set, I find that the model performs poorly.
I would guess that I now rethink my approach and then use some other approach afterwards. If that is the case, am I not still being influenced by the results on the test set?
If I do not change my approach after poor results on the test set, why use the test set in the first place?