Overfit/overtune the dev set?

Hi All,

I have a question about overfitting/overtuning.
In this slide, Prof Ng said if an algorithm did well on the training set, but not on the dev set, then we can try bigger training set because probably it’s overfitting on the training set. → I can understand this intuitively, because the algorithm’s parameters are learned from the training set. And it might be easier to overfit a small training set.

He also said if an algorithm did well on the dev set, but not on the test set, then we can try a bigger dev set because probably the algorithm is overtuned toward the dev set. → This is the part I don’t understand:

Since the algorithm is simply testing on the dev set instead of learning any parameters from the dev set, how can it be “overtuned” toward the dev set?

Also, if the new and bigger dev data set we use have the same distribution with the old and smaller dev data set, I suppose the problem of doing well on dev set but doing poorly on the test won’t be solved (because of the same distribution) .

Can anybody help with clarify these? Many thanks!

Hi Shangran,

Good question, you are right, the model never ‘trains’ on the dev set so it may seem that way. But what happens is in the training process we continuously change different hyperparameters and observe the performance on the train and dev set. After testing with multiple hyper-parameters we may finally find a set of hyper-parameters which give decent results on both the dev and train set.

However, since we have tested different combination of hyper-parameters against the dev set, it’s possible that we have just found a set of hyperparamters which when used with the training data works well on the dev set. Since we have tested so many times against the dev set we are biased towards selecting the hyper-parameters that work best on dev set and thus model indirectly gets influenced by the dev set.

That’s why we break the data into train-dev-test partitions and not just train-test partition so that after we have done all our hyper-parameter tuning we still have some data against which we can revalidate if model truly can perform well on unseen data.

Also, the bigger devset ‘may’ (or may not as you point out) solve the problem because it’s clear in this case that the test set has a different distribution than the dev set so there exists data with different distributions which isn’t captured by the dev set. So by expanding on the dev there’s a chance that we’ll capture the missing distribution and our model will learn the missing patterns.

Hope this helps.

2 Likes

Thanks @SomeshChatterjee It’s very helpful!