Fine Line between Training set, Dev set, and Test set

The intuition is clear that in order to address high variance or bias, we need a subset of the entire training examples and train our model on that data and then use the remaining subset, which is unseen by the model to check if the model has high variance or bias.

Training data is the subset of the training examples on which we train out model.
Dev set and Test set are the subsets that are unseen by our model.

It’s unclear that why do we need more than one subsets of unseen data to choose a model. We could just use one subset of unseen data, as only one subset influences our decision of choosing the right model.

In the lab, I switched the data sets of Dev Set and Test Set, and after running MSES I found out that by switching the data sets, the degree of 6 for the function was estimated to be a better model with even less General Mean Squared Error. (Second image represents output after data sets were switched)


This raises a question that, how to decide which data set should be treated as the unseen data, and further, how many subsets of unseen data should be created. How to know the fine line between Training set, Dev set, and Test set.

Here are some reasons for having three sets of data:

  • The training set is used for training.
  • The validation set is used to adjust the model (such as for optimizing the regularization, to avoid overfitting).
  • The test set is used as a final check whether the completed model gives “good enough” performance.

Hi @Ammar_Jawed a great question!

The performance of a model is only an estimation, you can actually see how good or bad your model is once is in production, to ensure it works as supposed to be working we need to replicate a real-world environment when unseen data is coming. Think about it like a test, if your goal is to have a good grade and create some demo test to practice that’s only an estimation, your real grade will be released after you take the real exam, in this case, your lectures are the training data, some previous exam are validation data, that you update after you study and you save some test to see if you learned the material one day before the exam, the production environment is the actual exam.

Please let me know if this answers your question!

The general technique is to randomly shuffle your data and split it into train, validation, and test sets.

Some problems may not require a test set at all, so whether to split into a test set or not depends entirely on the specific problem you are trying to solve.

Thanks for a detailed response. This means that test set is treated like a final check. The more the checks, the better the estimation.

Exactly, the test set is just you saving some data to mimic (trying) the production environment. It is a great question and you should definitely spend time learning as this would be one of the most crucial things to learn in machine learning since the accuracy of your models relies on these concepts.

For more, there is a great chapter in this book:

Designing Good Validation | The Kaggle Book (oreilly.com)

Thanks for sharing this, I’ve saved it as a resource that I’ll be looking into.

1 Like

I also had this discussion when learning this course. Basically the cross validation data (not testing) is used to indirectly influence the learning of weights in the model.

How I think is (using student and exam analogy),

  • Training Data: Example from book / question bank
  • Cross validation: Student checking his/her knowledge based on what they just learnt from the example, if they choose same examples they might be biased to the factors (keyword in question, ordering of question and etc), but actually not learning from these examples. They are just for personal check if they gained any information to solve the questions or not
  • Testing Data: Mock test paper conducted by the coaching institutes
1 Like

That’s a good analogy. And you’ve put it well by saying that it has an “indirect” influence on the learning of weights.

If you see my question I switched the test set and the dev set data and realized that after switching the data, I got different degrees that fit best to the model. So it certainly had an influence on the decision of choosing the right model.

The mock test can still be used as the validation. It is more about the sequence of LEARNING, THEN demonstrating that you have properly learned. Like the student, the mock test will give the student insight into what they must change and work on (adjust your understanding of Newton’s 2nd Law, adjust your approach to solving Taylor Series, etc.) to get a better grade when the final test comes around. The adjustments are essentially the optimization of your brain

1 Like

Why I put mock test on the Test Data because we (developers) still have access to the data and we can analyze and re-architecture the model. Like instructors in coaching classes can help us fix it.

Then according to me real word data is the final examination or actually the problems we face in real life.

No, that is what the Test Set is for.

So what is equivalent of production with my analogy?

Production is how you use the model after you have good enough results on the Test Set

Production can be how you synthesize information to make decisions after getting your degree and working professionally haha. The data is constantly changing and you have to generalize based on some fundamental training and optimization that you’ve had over the years, i.e, years of experience in a particular field. In my case, aerospace. I have bachelors and masters in mechanical and aerospace engineering but when I am designing jet engines I use that foundation to synthesize with years of experience (years of brain optimization) and make decisions. And it is all at inference time. Notice how you adjust your approach as you get older in life, your experiences teach you new ways of problem solving, so you adjust your brain for next time.

I hope that wasn’t too convoluted haha!

1 Like