If we have a deep learning training algorithm which has been developed, but new data of a different subtype is introduced (e.g. 18-wheeler trucks within a dataset of car images). Let’s say the DL Algo before new data is introduced has these metrics:

Human Err = 0.3%

Train Set Err = 0.15%

Dev Set Err = 0.17%

What is the number of new data that must be introduced into the data set before it can be detected as introducing error? Does the amount of new data have to account for at least 0.01% or even greater to be detectable by the people evaluating performance? I.e. if you add 0.004% of your dev set as new 18-wheeler images to an algo identifying cars, then the maximum error you can increase is still well above human performance here, yet all 18-wheelers may be misclassified. Does this necessitate data augmentation if you can’t get more data?

Hi Winston_Elliott,

This would appear to be an issue of trial-and-error, depending on the implementation you choose. Maybe this post can get you started.

Thank you for the Transfer Learning link, but I am more concerned with the new data set size RELATIVE to the current training set size, rather than new training data in general. Obviously, accuracy thresholds/optimization must be set by end users, but what is the optimal distribution of data with such a small shift in the data set. Complete error on the new data would not even account for the variance evident in the model. There are a few ways we may consider handling this (that I can think of as a naive pleb):

- create more new data using augmentation, etc. until it is above the variance of 0.02% or greater and distribute between the training and development set (then how do you know it works in test?)
- create a new metric, weighting losses resulting from the new data so that a 100% error rate is greater than the variance of 0.02% (w = 5, and how do you test or train?)

I guess the question here is what is the optimal path towards tuning based on what has worked in the past? What are some of the common pitfalls with these two responses or another, better response?

Edit:

If the complete data set is shuffled, or sampled with replacement (bootstrapping), then redistributed as a development set, the errors between the two will shift. Is this a better metric for increasing/weighting the new data to find necessary accuracy for the new data and the model is appropriately extracting features from a new subtype of data? This should be a good proxy for variance in the model accuracy and give a baseline variance in error rate? Does this break some of the rules of ML design?

Hi Winston_Elliott,

I am trying to understand the issue here. You have trained a model using car pictures excluding 18-wheeler trucks and you have achieved the human, train, and dev set errs you report. Then you introduce pictures of 18-wheeler trucks. When and why? During training, testing, or prediction? In order to classify them as 18-wheeler trucks rather than the ‘other’ category presumably used by the model? If you want the model to recognize the 18-wheeler trucks, the reference I provided should give you an indication of how to do this. If not, why would the system be expected to classify the 18-wheeler trucks incorrectly rather than classify them as ‘other’ which is fine if you don’t want the system to recognize them as 18-wheeler trucks? Or is your intention to retrain the model completely and you wonder how the model will do when the new data is shuffled into the dataset? That would seem to depend on the way in which features are extracted by the model relevant to the recognition of 18-wheeler trucks, which is not a simple mathematical/statistical question but rather an empirical one as the particular features extracted by the model result from random initializations and the effects of gradient descent.

Please clarify if this does not answer your question in any way. I am not sure I understand the issue correctly.

So, the issue is if you start with a dataset to identify cars which does not contain 18-wheeler trucks, and you train your model. You then deploy your model to find that 18-wheelers are a significant number of vehicles that must be identified in your application as cars. So, your model begins to fail, and you decide to retrain your model to include 18-wheelers which you want to identify simply as cars within your application. As a percentage of the total dataset, how much data is absolutely necessary to ensure your model predicts all car types, including 18-wheelers.

If the new data you add is too small, you may get 100% error on trucks when you retrain or include it in your dev set, but the effect of this error may be so small that it is unnoticeable without some of the conceivable solutions I listed in the earlier reply. Because the question is an empirical one, I’ve found that statistics about statistics (bootstrapping techniques) can be helpful to answer questions like what the distribution of error between sets is and helps determine thresholds. But this breaks some of the “laws” about keeping your data sets rigidly separate.

Hi Winston_Elliott,

Gradient descent is a mathematical approximative optimization technique, not a statistical one. So the outcome of the feature extraction process needed to recognize 18-wheelers cannot be captured with pure statistics (or statistics about statistics). Enough pictures of 18 wheelers need to be part of your training dataset, but you cannot calculate what percentage this should be to lower a particular percentage of error. Due to the cybernetic approximation process of gradient descent that aims to capture relevant features that may be applicable to various types of cars (starting with random initializations of relevant parameters) it is an empirical process of trial and error. Maybe it turns out that you need very few pictures of 18-wheelers due to the effect these few pictures have on the features that are extracted, maybe you need many. There’s no statistical short-cut here. You just have to iterate and perform error analysis.