Train/Dev/Test sets may originate from the same distribution, but they are randomly drawn from this distribution, and they have finite sizes. If we use ONE random Test set to compute ONE value of a metric for measuring our model quality (performance), then this value of this metric may be a poor estimate of the expected value of this metric (depending on the test set size and the metric unknown variance for computing the standard error of the mean estimate), which is a RANDOM VARIABLE, since it depends on the a randomly selected Test set instance. The expected value of this metric may be estimated with its mean across different instances of Test sets. But these instances will have different instances of Train / Dev pairs, which have to be used to rebuild the new model from scratch each time. This process may have to repeated for the Test set several (or many) times (e. g. 30 times, which is a “magic” number from statistics for sufficiently “large” distributions). Professor suggested to reduce the relative size of the test set in deep learning (e.g. from 20% to 1%). But such reduction will require more instances of the test sets for an adequately accurate estimate of the expected value of the metric from its computed mean.

Moreover, the Dev set is random as well, which means the model metrics on them are random variables too. It means that we cannot fully rely on the random sample values of these metrics while making decisions regarding tuning hyperparameters, unless we generate “enough” of them to estimate their expected value with their means. But at least in this case k-fold cross-validation helps, which is not the case with the Test set.

The lecture does not mention these challenges. Is there a reason for that? What is the standard approach? Is there an issue at all? If so, what is the remedy?