I am really confused about the train/dev/test sets guidelines and the related questions in the quizz.
The course tells that dev and test set should be from the same distribution. Shouldn’t the training set be drawn from the same distribution ? Doesn’t make sense to me to train a model on a training set drawn from a different distribution than the dev and test sets used to evaluate it.
Then when you get new data, I am totally confused about what you could do with them and got all the answer wrong to the related questions…
Should you add them:
to the training set ?
to the dev and test sets ?
to the training, dev and test sets ?
Does it depend on how much new data do you have ?
Let me break down the key points to provide clarity:
1. Training Set:
The training set is used to train your machine learning model, and it’s crucial that it represents the data distribution your model will encounter. Ideally, the training set, dev set, and test set should all come from the same distribution.
2. Dev (Validation) Set:
The dev set, or validation set, is employed to fine-tune your model’s hyperparameters and make decisions about its architecture. It should mirror the distribution of your training set, ensuring the model generalizes well to unseen data from the same distribution.
3. Test Set:
The test set evaluates your model’s performance post-training and tuning. Similar to the dev set, it should also stem from the same distribution as the training set, providing insights into how well the model will handle new, unseen data.
To illustrate, consider the Cat vs. Dogs example. If you train your model on high-quality 4K pictures downloaded from the internet, it’s essential to include some real-world images taken from phones during training. This ensures your model adapts to the distribution it will face during actual use.
Regarding Incorporation of New Data:
Adding New Data to the Training Set:
If you acquire new labeled data, consider adding it to your training set, especially if it aligns with the existing distribution. This enhances your model’s performance by exposing it to a more diverse dataset.
Remember, avoid adding new data directly to the dev and test sets. These sets are designed to simulate your model’s performance on unseen data from the same distribution as the training set. Introducing new data to the dev/test sets can compromise the validity of your evaluation.
Hope this clarifies the concepts for you!
Regards,
Jamal
In addition to Jamal’s great explanations, note that this course is the most sophisticated at dealing with data issues of any of the DLS courses. Prof Ng goes into a lot of subtlety and detail in the lectures. For example, he does make the point that the dev and test sets always need to be from the same statistical distribution, but it is possible to deal with a case in which the training data set contains data that is statistically different from that in the dev and test sets. This kind of scenario can happen with projects that have large datasets which may evolve over time. Prof Ng discusses this in some detail and mentions that in this type of case you can subdivide the training set to carve out a special “training-dev” subset.
There is a lot of detail and complexity here, so if you’re still not feeling comfortable that you have complete understanding, it would be a good idea to listen to the lectures again. E.g. if you missed that point about the “training-dev” subset, there are three lectures in Week 2 under the heading “Mismatched Training and Dev/Test Set” that cover all those points.