Do we need training and dev/test data to come of the same distribution?

npapadopoulos · December 15, 2021, 9:55am

Hello,

To my understanding, from the one hand it’s ok training and dev set to come from slightly different distributions, but on the other hand this leads to data mismatch in which case we 'll have to address this issue by making training data more similar to dev/test one. I am a little bit confused regarding which of the above is correct and why. There might be some misconception from my side on the respective part of the Lectures, but the above points seem to contradict.

Thanks!

Elemento · April 29, 2022, 4:19pm

Hey @npapadopoulos, welcome to the community. Apologies for the delayed response.

Let’s understand the answer to your query with a simple situation. Consider that we have train, dev and test sets appropriately distributed. Also consider that we have something known as production/future data (prod for short). At the time of training/testing, we don’t have this prod data, but when the model will be deployed, the model will be performing inference on the prod data, so our ultimate goal is perform the best we can on the prod data.

Consider 2 cases now:

First, when the train, dev and test sets all have the same distribution, but it differs from that of the prod set. In this case, the model will perform well on the test/dev sets, but fail to perform well on the prod set, and hence our ultimate goal is not met.
Second, when we make sure that dev and test sets reflect the distribution of the prod set, but differ slightly from the train set. This is same as your first hand. In this case, we will observe that the model is not performing well on the dev/test sets, due to data mismatch, and so, we will try to use different ways to overcome this data mismatch so that despite of our model being trained on a slightly different train set, it can still perform considerably well on the test/dev sets, and ultimately on the prod set, thereby meeting our ultimate goal.

Now, in the second case, where I have mentioned that “we will try to use different ways to overcome this data mismatch”, one of the ways could be to make the train data similar to dev/test data, but only if it doesn’t require bringing the dev/test data apart from prod data.

So, you see, in your query, you are missing out on the ultimate goal, which is not to perform well on the test/dev sets but to perform well in the production. I have also attached a slide for your reference below.

I hope this helps, but if you still have any queries, we will be more than happy to help you.

Regards,
Elemento

Rashmi · May 5, 2022, 10:59am

Well said Elemento!

The inference here of production data is another angle to explain the query in a pretty good way

Topic		Replies	Views
The consequence of different distribution in train dev and test Structuring Machine Learning Projects coursera-platform	1	789	May 22, 2021
DLS 3 W1 Train/Dev/Test Distributions Structuring Machine Learning Projects coursera-platform	5	566	November 29, 2022
Question in the introduction video Structuring Machine Learning Projects coursera-platform	3	543	October 28, 2021
Question regarding a quiz from "Bird recognition case study" Structuring Machine Learning Projects coursera-platform	1	485	June 13, 2023
Data distribution for training-dev set Structuring Machine Learning Projects coursera-platform	2	557	December 29, 2022

Do we need training and dev/test data to come of the same distribution?

Related topics