Data distribution for training-dev set

I would be appreciated if the following point is explained more.
It is about training and testing on different distributions lecture.
When we distribute data, we previously had training, Dev and test sets. Now, we have been introduced training-dev set. The reason for it is related to separating the variance and data mismatch issues.
Now, I need to know exactly how we create the training-dev set. I mean, are we shuffle the data after mixing training set and some part of dev set?
As I have watched the lecture three times, we do not merge but we allocate 200k data for training the data. Then, we use 5k for training-dev set. I mainly have problem with this because if the data is totally new in training-dev set, there is still a degree of data mismatch problem when we check the variance.
Please clarify.

Course 2 week 1 talks about common data split strategy based on the amount of available data. Here’s the link.

You can follow the same principles for creating the training-dev set from the train data. It’ll be a 2 way split instead of a 3 way split. Suffling the training data with a seed is a good way to ensure reproducible splits.

There are 2 scenarios to consider based on the outcome:

  1. If we have 1% error on train set, 1.5% error on training-dev set and 10% error in dev set, we have a data mismatch problem.
  2. If we have 1% error in train set, 10% error in training-dev set, we have a high variance problem. Applying techniques like regularization can help get around this issue.

Your concern is addressed in the 2nd point.

1 Like

I think the point you are missing is that the training-dev set is by definition a subset of the training set, so it’s not statistically different. As Balaji says, you randomly shuffle the training set and then extract a relatively small fraction of it to form the training-dev set. The point of that strategy is when you have a case in which the dev and test sets are from a different distribution than the training set, this is one way to cope with that problem.

1 Like