Is it acceptable to evaluate our models on the test set more than once?

In the “Bird Recognition in the City of Peacetopia (Case Study)” quiz, the answer to one of the questions implies that, after evaluating the model on the test set, if we detect we’ve overfit to the dev set, we can increase the dev set, iterate on our model using the new dev set, and then evaluate the model on the test set again.

Is my interpretation correct? Wouldn’t this make the second test set error estimate biased? Would this bias be acceptable in a real application?

Hey @Gabr,
Two things come to my mind. One is an easy solution to fix this which could work in the cases of small training time and computation requirements, and another is the reason as to why the second test set error can be considered as “almost” non-biased.

Starting with the easy solution, i.e., training time is small and has small computation requirements. In this case, after finding that the model is over-fitting on the dev set, we can re-adjust the dev set, and train the model from scratch, i.e., iterating over [training on train set & evaluating on dev set], and finally evaluating on test set. Since the model is trained from scratch, hence, it will be seeing the test set for the first time, and doesn’t pose any risk of biased estimate.

The above solution as you might have guessed won’t be valid in most of the real-world cases. The reason behind the existence of dev set is to have some data on which we can evaluate our model again and again, in order to guide our model in the right direction, and since the model sees the dev set again and again, it may overfit on it, and hence, we need a new set now, namely the test set to evaluate the performance of the model. Since the model sees the test set for the first time, hence, we can consider the model’s performance on the test set to be pretty useful.

Now even if we evaluate our model again on the test set, the model is still seeing the test set for the second time, and hence, we can consider this as “almost” non-biased. Here, note that I have emphasised “almost”. This is because you may need to readjust your dev set multiple times if you don’t do it right in the iterations, and in this case, the model may keep on seeing the test set again and again, and with that, the estimate will keep on getting biased and biased.

Now, the question will this bias be acceptable in the real world. Depending on how many times you readjust your dev set, and how large your test set and the model is, in some cases, it may be acceptable, and in others, it may be not.

Here, I haven’t included the case in which you can simply acquire a new test set. If this is possible, then we can consider this as a pretty nice way to go as well. I hope this helps.

Regards,
Elemento

As long as the test set was not involved in any of the retraining with the new dev set, then I don’t see why that would be a problem. As long as the newly trained model was not trained on any of the test data, then you should still be able to use that as an unbiased evaluation of the performance of any trained model.

Thanks for the answers. I used to think like @Elemento, but after giving @paulinpaloalto’s answer some thought, I think I can see why this procedure does not bias the error estimate.

I think the crux of the matter is that adding more data to the dev set cannot introduce bias to the error estimate as long as the additional data are independent from the test data, and as long as we optimize the model’s hyperparameters using only the new dev set. Since the additional data come independently from the same distribution as the original dev data, there is no way we can bias the error estimate toward or away from the test data.

The only thing that the test data influence is the choice to add more data to the dev set, which I believe is irrelevant in terms of bias due to the reasons mentioned above.

1 Like