Creating and randomizing training, dev, and test data sets

Hello, I am working on the Deep Learning Specialization and just finished the first course: Neural Networks and Deep Learning. I am eager to get started on creating my own NN and my question is: What is the best practice on randomizing and splitting my dataset into train, dev, and test sets? Will this be covered at some point? Or if someone can simply answer this question for me, that would be great as well! Cheers!

This will be covered in Week 1 of Course 2. The short answer is that you just take all of your data together, randomly shuffle it and then split it into the three sets. The only question is the sizes of the various sets and that decision depends on the total size of the data that youā€™ve got. If it is relatively small (< 10k samples), then the usual sizes would be 60% training, 20% dev and 20% test or 80/10/10. If itā€™s a relatively large dataset (O(10^6) samples or greater), then youā€™d probably use 98% training and 1% for the others. For in between sizes, you can interpolate between those extremes. Please stay tuned and listen to more details on this when Prof Ng discusses it in Course 2.

But the overall point is that you just randomly select the various subsets from the same unified pool of data, so that you end up with statistically similar properties. Itā€™s a mistake to do it in a way that gives you different properties (e.g. use only outdoor photographs for training and indoor photographs for dev and test).

Thanks Paul! And I assume Python has an easy way to do the shuffling/splitting? Iā€™m searching for that now. Any preference here?

Oooh, I think I may have found a good article about this but am curious what you use for this.

There is an assignment in Week 2 of Course 2 where they suggest how to do that. See the Optimization Assignment. The technique used there is to split the training set into mini-batches, but the shuffling technique is completely generic.

You can use np.random.permutation to generate a permuted list of numbers and then use ranges of that list as indices into the ā€œsamplesā€ dimension of your arrays.

Hereā€™s a little experiment to show the idea:

np.random.seed(2)
perm = list(np.random.permutation(8))
print(f"perm = {perm}")
A = np.random.randint(0,10,(2,8))
print(f"A = {A}")
print(A[:,perm[0:4]])
perm = [4, 1, 6, 2, 3, 7, 5, 0] 
A = [[2 1 5 4 4 5 7 3] 
     [6 4 3 7 6 1 3 5]] 
[[4 1 7 5] 
 [6 4 3 3]]

Thanks Paul! Iā€™ll give that a try!

One more question if you donā€™t mind. Do you recommend I wait until the end of the specialization to try creating my own NN? I might be jumping the gun here ā€¦ Iā€™m just so excited to try the techniques on my own projects. I donā€™t think it hurts, and it helps me understand the code better if I have to do it with my own project, but trying it out on my own takes away my time for continuing the specialization. :slight_smile: Iā€™m wondering if I ought to be a little more patient and take the remaining courses first. :thinking: I know thereā€™s no right or wrong here, just asking for what you would recommend.

Hi, Seth.

I think itā€™s a great idea to try applying the ideas from Course 1 to solve a new problem that is of interest to you. As you say, it will definitely help you understand the code base we have and also the concepts and how things work. And it will also give you that much more understanding and appreciation of why the things Prof Ng will be showing us in Course 2 are relevant and useful.

In the longer term Prof Ng will introduce you to TensorFlow, which is a higher level package that has ā€œcannedā€ routines for doing all the things that Prof Ng has showed us how to build ourselves directly in python and numpy here in Course 1. TF also has lots of additional functionality and is the way people normally build DNNs to solve real problems. But Prof Ng has a strong pedagogical reason for teaching us how to do build a DNN directly in python first: If you start by learning TF, then everything is just a ā€œblack boxā€ to you. It is almost always the case that things donā€™t work very well the first time you try putting together a solution for a given problem. If you donā€™t have the kind of understanding that you get from seeing whatā€™s really happening ā€œunder the coversā€, then itā€™s hard to develop the intuitions for what to do when things donā€™t work the way you want. All that will be a major topic for Course 2.

Doing the kind of experimentation that youā€™re describing will definitely help give you the skills for that kind of problem solving. Even if it delays you a bit from proceeding with the rest of the courses, I bet youā€™ll find that it will be worth it. Having a better understanding of the Course 1 material will also give you a better vantage point for being successful in the rest of the courses. Give it a try and youā€™ll probably know pretty quickly whether youā€™re finding it useful or not.

Please let us know how it goes and if you come up with any cool solutions or new insights!

2 Likes

Hello,
I am currently covering the Course 2:
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

My question is related to train/dev/test sets and the final model that you would use.
In the course, they are suggesting a way to divide into 3 sets and develop the model. I assume that the ā€˜final modelā€™ (the optimized model that you would apply in your job) is trained only with the training set.
However, I have heard also suggestions that itā€™s best to develop the model with train/dev/test in the same manner as it was explained in the course but once you have optimized the parameter, you should retrain a final model with the optimum parameters and 100% of the data.

What is your opinion on this? Does it make sense?

Prof Ng covers these points in quite a bit of detail in Week 1 of Course 2. Iā€™ll give just a high level summary and then you should definitely proceed through Course 2 and hear the full explanation from Prof Ng.

The idea is that the three datasets are for different purposes:

You always use the training data for the training phase, but the ā€œdevā€ and ā€œtestā€ sets are used for different purposes. You train with the training set and then use the ā€œdevā€ set to evaluate whether the hyperparameters you have chosen are good or not. That includes everything from the network architecture (number of layers, number of neurons, activation functions ā€¦) to the number of iterations, learning rate, regularization parameters and so forth. That means you do training with the training data and then evaluate the accuracy on the dev set in this phase.

Once you have used the training set and dev set to select what you believe are the best choices for the hyperparameters, then you then finally evaluate the performance of that ā€œfinalā€ model on the test data. The point being that you want the final test to use data that was not involved in any aspect of the training up to that point, so that you get a fair picture of the performance on general input data.

@paulinpaloalto just some feedback here for what itā€™s worth. After the course ā€œNeural Networks and Deep Learningā€ I was so excited to apply what I learned to my own problem set. I went on to the course ā€œImproving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimizationā€ and then the next ā€œStructuring Machine Learning Projectsā€ and I still was lost how to APPLY what Iā€™ve learned to my own problem set in my domain of expertise. I lost interest. I couldnā€™t understand why Iā€™m learning how to tune hyperparameters and structure my machine learning projects when I donā€™t have a neural network to tune or structure. Maybe these courses are for those who already have ML projects they are working on.

I think I would have been more engaged if I could create my own, neural network with my own data. I know I ā€œfilled in the blanksā€ in the homework, but thatā€™s way different than creating your own from scratch. In fact, Iā€™m still longing to do this and I have no idea where to start applying these concepts to my own data.

Anyhow, just wanted to give that feedback because I love what you at DeepLearning are trying to do for learning AI.

Train/Dev/Test sets may originate from the same distribution, but they are randomly drawn from this distribution, and have finite sizes. If we use ONE random Test set to compute ONE value of a metric for measuring our model quality, then this value of this metric may be a poor estimate of the expected value of this metric (depending on its variance though, which we do not know either), which is a RANDOM VARIABLE, since it depends on the a randomly selected Test set. The expected value of this metric may be estimated with its mean across different instances of Test sets. But these instances will have different instances of Train / Dev pairs, which have to be used to rebuild the new model from scratch each time. This process may have to repeated for the Test set several times (e. g. 30 times, which is a ā€œmagicā€ number from statistics for sufficiently ā€œlargeā€ distributions). Moreover, the Dev set is random as well, which means the model metrics on them are random variables too. It means that we cannot fully rely on the random sample values of these metrics while making decisions regarding tuning hyperparameters, unless we generate ā€œenoughā€ of them to estimate their expected value with their means. But at least in this case k-fold cross-validation helps, which is not the case with the Test set. The lecture does not mention these challenges. Is there a reason for that? What is the standard approach? Is there an issue at all? If so, what is the remedy?

Sorry, I forgot to get back to this question. I think I just answered the same question on a different thread. Hereā€™s perhaps a simpler response:

It think the issue you hypothesize is not really an issue. What do you mean that the metric of accuracy on the test set is a poor estimate of the performance of the model? Itā€™s the only metric weā€™ve got. If itā€™s not a good proxy for the ā€œrealā€ performance of your model, then that just means you did a bad job of selecting the test set and you should start over with a better subdivision of your overall data.

Hi, Seth.

I apologize that I somehow missed your reply on this thread when you made it, which is now quite a while ago. I was reminded by Jamesā€™s recent reply.

I think you have missed some important points here, but maybe that indicates some problems in how the material is presented. ā€œHaving an ML projectā€ doesnā€™t mean that it comes with a predefined neural network architecture to solve it. Figuring out the type and architecture of the network that will be required is your job as the designer and is part of what Prof Ng is trying to teach here. The way to start is to observe the types of networks that Prof Ng shows us for solving different kinds of problems. Then you have to consider the nature of your new problem. What is the data? What is it that you want to detect in the data? Is it an image recognition problem or a computer vision problem or a language problem? Or something else? Have you seen in any of Prof Ngā€™s examples in courses 1 - 5 any cases that are at least somewhat similar to the problem you are trying to solve? If so, then you start with the solutions that Prof Ng shows and see if you can adapt them to your problem. How well does it work? If not as well as you need, then youā€™ve got a starting point and you apply the types of analysis that Prof Ng describes in Course 2 and Course 3 to try to improve the solution. In Course 4, he also introduces the important technique of Transfer Learning in which you can actually take a trained network and then use it as the starting point for a new problem and then add some specialization layers to adapt it to your problem.