Module1, Setting Up your Goal: Is one test set sufficient for an adequate model performance estimation?

Train/Dev/Test sets may originate from the same distribution, but they are randomly drawn from this distribution, and they have finite sizes. If we use ONE random Test set to compute ONE value of a metric for measuring our model quality (performance), then this value of this metric may be a poor estimate of the expected value of this metric (depending on the test set size and the metric unknown variance for computing the standard error of the mean estimate), which is a RANDOM VARIABLE, since it depends on the a randomly selected Test set instance. The expected value of this metric may be estimated with its mean across different instances of Test sets. But these instances will have different instances of Train / Dev pairs, which have to be used to rebuild the new model from scratch each time. This process may have to repeated for the Test set several (or many) times (e. g. 30 times, which is a “magic” number from statistics for sufficiently “large” distributions). Professor suggested to reduce the relative size of the test set in deep learning (e.g. from 20% to 1%). But such reduction will require more instances of the test sets for an adequately accurate estimate of the expected value of the metric from its computed mean.

Moreover, the Dev set is random as well, which means the model metrics on them are random variables too. It means that we cannot fully rely on the random sample values of these metrics while making decisions regarding tuning hyperparameters, unless we generate “enough” of them to estimate their expected value with their means. But at least in this case k-fold cross-validation helps, which is not the case with the Test set.

The lecture does not mention these challenges. Is there a reason for that? What is the standard approach? Is there an issue at all? If so, what is the remedy?

I don’t remember Prof Ng mentioning anything like the issues you have discussed or trying to solve them by using multiple different test sets. The point is that you have a finite amount of data to use for three distinct purposes: training, tuning hyperparameters (the dev set) and for assessing the performance of the final trained model (the test set). In most cases, you start with one large set of curated data and then you randomly shuffle that and select (once) the training set, the dev set and the test set. Of course the sizes of the various subsets must be chosen and Prof Ng does discuss that choice in some detail. Your suggestion here (and I think I remember you raising a similar idea on a different thread) of changing the selection of the subsets repeatedly I think does not make sense. The point that Prof Ng does make at a number of places is that the whole reason the test set is separate is that you don’t want the results on the test set to be biased in any way by any part of that actual data having been used in the training process. You want to evaluate the model on data that it has never “seen” before as a proxy for how it will perform with new input data when it is deployed in real use.

If you want to further subdivide the test set and compare the performance on different subsets of it, then I don’t really see what value that adds. If you do the subdivision randomly (fairly), then it’s just a question of the balance between the minibatch size and the amount of noise you see. There might be something to gained by doing “error analysis”: trying to see if there are patterns in the samples that the model does not predict correctly. Maybe that will lead you to actionable conclusions about what additional types of data you need to add to your whole dataset in order to get better performance. E.g. maybe the model does really well at identifying cats inside a building with furniture and artificial light, but not outside with natural light and surroundings. So you need to start over with a dataset that includes more outside images. Just a contrived example to make the point, of course.

There are some other more sophisticated issues to do with cases in which the training data may not be from the same statistical distribution as the dev/test data. Prof Ng discusses a number of scenarios of that sort in Week 2 of Course 3. If you haven’t finished Week 2 yet, maybe the best idea is to “hold these thoughts” and keep listening.

Thank you for your detailed response. I will listen to Week 2 Course 3 lectures soon. For now, let me clarify a few points. I do not suggest “to further subdivide the test set and compare the performance on different subsets”. I do not suggest that in my scenario “the training data may not be from the same statistical distribution as the dev/test/data”. We can assume that train/dev/test subsets are from exactly the same distribution.

My point was about the Standard Error of the sample mean of the metric that we are trying to estimate with just one set sample (aka just one test set). Suppose that we do the following model-building steps 20 times:

  1. randomly shuffle the full set
  2. split it into train/dev/test “current” subsets
  3. build the best possible model following the same (unchanging) procedure with the current train/dev subsets.
  4. estimate a metric on the current test set.

In this case, we will get a distribution of metrics for a given model-building procedure. We have just generated 20 RANDOM samples from this distribution. Now we can compute a sample mean and a standard error on the mean of this metric. Now we choose the model (out of 20) that produced a sample metric value that was the closest to the above mean, and we report this mean metric (derived from the random test sets) in our model release documentation for the clients. If we do not want to repeat this model-building procedure 20 times, then we should not reduce our training set from 20% to 1% of our full data set. That was my point.

The problem with your method is that it is “impure”. The accuracy on the test set is polluted, because on every iteration other than the first data that is in the test set on the current iteration was part of the training set on previous iterations. You might as well just take the union of all the data and use it for training, dev and test. But that is impure. That’s the whole point. What does overfitting mean if you’ve trained on all the data?

As I said on the other thread, if the predictions on the test set are not an “accurate representation” of the performance of your model, then the question is how you define that accuracy? The definition is the performance on the test set, right? The only possibility is that you did a bad job of selecting the test set and it is not statistically representative of your data. Recall the discussions about how to select the sizes of the validation set and test set relative to the size of your total dataset. That’s why the rule of thumb is that if you have O(10^4) total samples, then you typically use something like 60/20/20 as the balance between train/dev/test. If you have > O(10^6) total samples, then you might use a balance closer to 96/2/2 or 98/1/1. It is very common in real systems to have total datasets much larger than O(10^6).

Each iteration is completely independent from all others. In fact, each iteration can be done by a different person independently from all others. Thus, they are “pure”, as I described them.

I think we are talking at cross purposes here. It doesn’t matter who clicks the mouse: the question is the data, right?

Also note that training typically takes at least tens of thousands of iterations and sometimes at lot more than that.

If the test set has just one sample (aka example), then we should likely agree that this “test set is not an “accurate representation” of the performance of my model”. Am I correct? What makes us think that the training set of only 1% of the full set is an “accurate representation” of the performance of my model. Well, I can just take it for granted from the lecture. Otherwise, in real life with a real dataset in industry I would have to either make this testing set big enough, or follow that tedious (and unpractical) model-building procedure that I described above.

I’ll have to go back and listen again to exactly what Prof Ng says in the section about how to decide on the sizes of the various sets. (It’s been a couple of years since I took Courses 2 and 3.) But I’m pretty sure that everything he says is with the disclaimer that these are just general rules of thumb and there is no fixed recipe that is guaranteed to work in every case. If you have a real dataset and a real problem you are trying to solve, then you will need to do some experimentation to evaluate lots of the various choices you need to make to design your solution, including the sizes of the training/dev/test sets.

Clearly we can agree that having a test set with only 1 element can’t possibly be statistically representative of an entire set with, say, 10^6 samples. It sounds like you have more statistics background than the typical student, so how would you approach the question of how to select a subset of a dataset with 10^6 elements such that the subset is “statistically representative” of the overall set? How would you evaluate something like that?

If I have a full dataset with 1M samples, and I need a random test set from it, then I would try to minimize the test set size, but make sure that it is adequate. Clearly, this optimal size is dataset-dependent, and I would not dare to derive a closed-form formula. I would prefer to estimate this size empirically, as I described earlier. Alternatively, I can give some hints that would allow to avoid this empirical costly estimation.

First of all, I would use stratified split, which would help a bit. In case of classification, I would do stratified split on the target variable, and then make sure that each class has at least 30 (preferably at least a few hundreds) samples in the smallest subset, which is the test subset in our case. For example, if we predict loan defaults with only 1% defaulted samples, and the test subset size is 1%, then our test subset will have only 100 defaults. This is not bad, but I would prefer more (to minimize the Standard Error for this subset of the test subset: sigma_mu_bar = sigma / sqrt(num samples)).

In case of regression, 10,000 samples are normally good enough, if we consider just the target variable. The problem arises if we consider features of the model that may require stratification. For example, suppose that we want to predict prices of houses, we have 1M of them, and only 1% (10,000) of them have golden tiles on the floors (other houses have different tile types). These are not outliers (at 1%), and this feature has a clearly a big impact on the house price. We may fail to identify (find) this feature in order to stratify on it (in conjunction with other rare but important features). Even if find it and we stratify on it, we will get just 100 such “golden” samples in the test set. This is not perfect for the purposes of proper representation of the population in the test set, especially in multidimensional space. I would increase the test subset above 1%, and I would hope that I did not miss other rare underrepresented important factors.

I conclusion, I would say that we should not be greedy: if our dataset is big, and our clients want an accurate metric of our model performance, and we do not want to build 100 independent models on different train/dev/test splits of the same full dataset, we may need to allocate a decent fraction of samples for our single test set. The train subset will have its fair share from the large full set anyway.

Thanks for the detailed response. You have described lots of great ideas. I don’t think Prof Ng mentions the concept of stratified split anywhere, but I’m pretty sure he does discuss the ideas around balanced datasets somewhere in Course 3. That is the number of samples you have for each of the possible label types in your data.

You may be working a bit too hard in the stratification case, in that if you have a large dataset and randomly select a non-trivial subset of it, then you would naturally expect that the statistical distribution of the label classes in your selected subset is very close to that of the total dataset. If it’s not, then doesn’t that simply mean that either your random sorting algorithm is not really random or your subset size is too small? But it is a good point that it worth analyzing the distribution of the label types in your various selected random subsets to make sure they are reasonable. And it may well be that even if you have some classes that are underrepresented in the overall dataset that you may get better behavior if you include more of them in the test set, as long as you can achieve that without depleting that class too severely in the training set.

But the overall conclusion seems to be that you are well equipped to handle all these issues when it comes time to tackle a serious real world problem!