Data distribution for training-dev set

S.hejazinezhad · December 29, 2022, 10:12am

Hello,
I would be appreciated if the following point is explained more.
It is about training and testing on different distributions lecture.
When we distribute data, we previously had training, Dev and test sets. Now, we have been introduced training-dev set. The reason for it is related to separating the variance and data mismatch issues.
Now, I need to know exactly how we create the training-dev set. I mean, are we shuffle the data after mixing training set and some part of dev set?
As I have watched the lecture three times, we do not merge but we allocate 200k data for training the data. Then, we use 5k for training-dev set. I mainly have problem with this because if the data is totally new in training-dev set, there is still a degree of data mismatch problem when we check the variance.
Please clarify.
Thanks,

balaji.ambresh · December 29, 2022, 12:00pm

Course 2 week 1 talks about common data split strategy based on the amount of available data. Here’s the link.

You can follow the same principles for creating the training-dev set from the train data. It’ll be a 2 way split instead of a 3 way split. Suffling the training data with a seed is a good way to ensure reproducible splits.

There are 2 scenarios to consider based on the outcome:

If we have 1% error on train set, 1.5% error on training-dev set and 10% error in dev set, we have a data mismatch problem.
If we have 1% error in train set, 10% error in training-dev set, we have a high variance problem. Applying techniques like regularization can help get around this issue.

Your concern is addressed in the 2nd point.

paulinpaloalto · December 29, 2022, 4:40pm

I think the point you are missing is that the training-dev set is by definition a subset of the training set, so it’s not statistically different. As Balaji says, you randomly shuffle the training set and then extract a relatively small fraction of it to form the training-dev set. The point of that strategy is when you have a case in which the dev and test sets are from a different distribution than the training set, this is one way to cope with that problem.

Topic		Replies	Views
Course 3 Week 2 Train-Dev set question Structuring Machine Learning Projects	1	528	July 9, 2022
Do we need training and dev/test data to come of the same distribution? Structuring Machine Learning Projects	2	657	May 5, 2022
Train_dev_test split doubt Structuring Machine Learning Projects	2	539	September 21, 2022
Week 1: train/dev/test split Improving Deep Neural Networks: Hyperparameter tun	5	528	December 19, 2022
Why train-dev set has the same distribution of train set and not of the dev set? Structuring Machine Learning Projects	1	544	August 12, 2021

Data distribution for training-dev set

Related topics