Two things come to my mind. One is an easy solution to fix this which could work in the cases of small training time and computation requirements, and another is the reason as to why the second test set error can be considered as “almost” non-biased.
Starting with the easy solution, i.e., training time is small and has small computation requirements. In this case, after finding that the model is over-fitting on the dev set, we can re-adjust the dev set, and train the model from scratch, i.e., iterating over [training on train set & evaluating on dev set], and finally evaluating on test set. Since the model is trained from scratch, hence, it will be seeing the test set for the first time, and doesn’t pose any risk of biased estimate.
The above solution as you might have guessed won’t be valid in most of the real-world cases. The reason behind the existence of dev set is to have some data on which we can evaluate our model again and again, in order to guide our model in the right direction, and since the model sees the dev set again and again, it may overfit on it, and hence, we need a new set now, namely the test set to evaluate the performance of the model. Since the model sees the test set for the first time, hence, we can consider the model’s performance on the test set to be pretty useful.
Now even if we evaluate our model again on the test set, the model is still seeing the test set for the second time, and hence, we can consider this as “almost” non-biased. Here, note that I have emphasised “almost”. This is because you may need to readjust your dev set multiple times if you don’t do it right in the iterations, and in this case, the model may keep on seeing the test set again and again, and with that, the estimate will keep on getting biased and biased.
Now, the question will this bias be acceptable in the real world. Depending on how many times you readjust your dev set, and how large your test set and the model is, in some cases, it may be acceptable, and in others, it may be not.
Here, I haven’t included the case in which you can simply acquire a new test set. If this is possible, then we can consider this as a pretty nice way to go as well. I hope this helps.