Do we need training and dev/test data to come of the same distribution?

Hello,

To my understanding, from the one hand it’s ok training and dev set to come from slightly different distributions, but on the other hand this leads to data mismatch in which case we 'll have to address this issue by making training data more similar to dev/test one. I am a little bit confused regarding which of the above is correct and why. There might be some misconception from my side on the respective part of the Lectures, but the above points seem to contradict.

Thanks!

1 Like

Hey @npapadopoulos, welcome to the community. Apologies for the delayed response.

Let’s understand the answer to your query with a simple situation. Consider that we have train, dev and test sets appropriately distributed. Also consider that we have something known as production/future data (prod for short). At the time of training/testing, we don’t have this prod data, but when the model will be deployed, the model will be performing inference on the prod data, so our ultimate goal is perform the best we can on the prod data.

Consider 2 cases now:

  • First, when the train, dev and test sets all have the same distribution, but it differs from that of the prod set. In this case, the model will perform well on the test/dev sets, but fail to perform well on the prod set, and hence our ultimate goal is not met.
  • Second, when we make sure that dev and test sets reflect the distribution of the prod set, but differ slightly from the train set. This is same as your first hand. In this case, we will observe that the model is not performing well on the dev/test sets, due to data mismatch, and so, we will try to use different ways to overcome this data mismatch so that despite of our model being trained on a slightly different train set, it can still perform considerably well on the test/dev sets, and ultimately on the prod set, thereby meeting our ultimate goal.

Now, in the second case, where I have mentioned that “we will try to use different ways to overcome this data mismatch”, one of the ways could be to make the train data similar to dev/test data, but only if it doesn’t require bringing the dev/test data apart from prod data.

So, you see, in your query, you are missing out on the ultimate goal, which is not to perform well on the test/dev sets but to perform well in the production. I have also attached a slide for your reference below.

I hope this helps, but if you still have any queries, we will be more than happy to help you.

Regards,
Elemento

Well said Elemento!

The inference here of production data is another angle to explain the query in a pretty good way :slight_smile:

1 Like