It is mentioned that adding additional Training Set data, which distribution differs from Dev/Test Sets, helps on the Dev Set performance. However, distribution of Dev and Test Sets should be the same.

Could you please give some intuitive (or statistical) explanation that adding to the Train Set additional examples from different distribution helps on the Dev Set performance. I would imagine that it might lead to higher Dev Set error , since Dev Set did not see the new data with the different distribution. Would not it be more sensible to split additional data with different distribution into Training / Dev / Test Sets with the same proportions as applied to original data.

What is your rule of thumb on the amount of additional data with the different distribution you add to the Training Set which will make sense on the Dev Set performance? And how much difference in distribution of additional Training Set as compared to the original Training Set is within acceptable limits to expect performance improvement of the Dev Set?

I also have the same question as yours.
I want to add one more thing. Does it mean that when setting Dev set and Metrics we are keeping the target fixed? When we have additional data from different distribution to training set, are we are not bothered about the target that is fixed?
What it means to have data from same distribution? Apart from the same source is there any statistical measure to compare if data are from same distribution.

Note that our goal is not just to make a model that achieve a low error on the test set, we want a model that generalizes well on real-world data (meaning data the model will be used after deployment). We assume that dev and test sets close to that real-world data distribution.

That is also correct. We want to evaluate on data that, we think, close to real-world data.

That said, the more data we train on â€“ the better results. At least as far the data is relevant to the task and our model has capacity to learn from it.

Hi Andrei (@manifest), thanks for replying some of the questions!

Could you please comment on these 3 questions which are still open.

Referring to what @amitp asked. How do you define a distribution for a dataset of pictures? What measure do you normally use for a distribution? Is it type of Variational Autoencoder you train to get distribution measures?

What is your rule of thumb on the amount of additional data with the different distribution you add to the Training Set which will make sense on the Dev Set performance?

How much difference in distribution of additional Training Set as compared to the original Training Set is within acceptable limits to expect performance improvement of the Dev Set? Or said simpler, how different should another distribution be so that you still arrive at performance improvement of the Dev Set? Are you looking at covariance matrix variation between old and new datasets?

Iâ€™m not sure that I follow the first question. As for the second and third, why donâ€™t you use an evaluation metric to compare performance of your model on different training data?

With regards to the first question. From basic statistics the measures of distribution for time series are: mean, standard deviation, shape of probability distribution function, etc. What are the measures of distribution for pictures dataset?

To questions 2 and 3. Sure we may use evaluation metric to identify the difference. But it means trial and error and waste of time and resources. Does it make sense to retrain the model for every new dataset without preliminary thought/estimate of sensibility of doing so? Will it not be possible for you to get some distribution measures of the new dataset so that itâ€™ll spare you from useless retraining?

In ML, by distribution we assume a probability distribution. That means to describe it, we use probability mass function (for discrete random variables) and probability density function (for continuous random variables).

The problem here is that this comparison of data distributions would only make sense if we compare the distribution of a new data to the real-world data distribution. But if we knew the real-world data distribution, we wouldnâ€™t need a neural network. It would be possible to solve the problem using some traditional optimization algorithm.

Test and dev sets our best attempt to match the real-world data based solely on understanding of a particular task we are trying to solve.

With an evaluation metric we are able to compare performance of our solutions

Just point of view, trying to understand the â€śdistributionâ€ť of the dataset we sort of cross the border of intuition here and need statistical and probabilistic math tools. I would guess that to learn a distribution of the dataset you might use variational autoencoders (VAE) link[Tutorial #5: variational autoencoders]. But in itself it is another ANN which needs to be trained. Maybe @manifest may suggest how he uses VAEs (or other tools) to do assessment of distribution prior to large scale model retraining. I guess this thread becomes too advanced for DLS level.

We build neural networks to model the real-world data distribution. If we already know that distribution, we donâ€™t need neural networks. In other words, if we knew everything that may happen, there would be nothing to predict

Evaluation of generative models quite a problem itself

Practically, we would need to inspect all citizensâ€™ data, as there is a big chance people would end up uploading inaccurate pictures whether unintentionally or intentionally (adversarial attack initiated by some people who like birds), taking into consideration that we already have 10 million pictures from a trusted source with some level of accuracy, it wouldnâ€™t be feasible or at least necessary to go over the 1 million citizensâ€™ pictures, would it?

Technically, if citizensâ€™ pictures distribution is noticeably different from security camera pictures distribution, say almost all security camera pictures are horizontal while almost all citizensâ€™ pictures are vertically pointed to the sky, my intuition is, adding citizensâ€™ data to training data would create unnecessary bias for dev/test data that regularization wonâ€™t be able to fix, am I missing something here?

last thing, by having a fairly big amount of security camera pictures, which will be our source of data in real world, why canâ€™t we consider security camera pictures distribution as the real world distribution or very close to it?

Another point I would like to add: the model weights are usually saved based on the validation or test set. Thus we would want the test sets to be as similar to the validation set.

Another reason is the following: suppose you have a different distribution of the number of samples from each class in the dev set. Let us say 90 of class 0 and 10 of class 1. The model is great in predicting class 0 and predicts with 100 perc accuracy and it predicts class 1 with 50 perc accuracy. Then the average accuracy in the dev set would be 95. However, the test set might have 10 samples from class 0 and 90 from class 1. Now, the accuracy suddenly dropped to around 55 percent from 95 percent. Therefore, we must try to have the same distribution in the dev and test set.
The problem of a test sample being a completely random image which is not related to the data it has been trained on is called out of distribution detection. OOD is an active research field right now!

What if we use fine-tuning (transfer learning) approach and use the larger dataset (e.g. high def cat images which are larger in number) for training a base model and then fine-tune on the blurred images which are probably less in number. Would this be any helpful?

I think when the sizes of different datasets (high-def and blurred) are similar, then this might not help very much.

So I understand it this way. If we have a lot of images of birds from a single source eg. a bird photographer, it is a good idea to enhance our data and include images taken by, letâ€™s say tourist, environmentalist, or institutes or even online images

Just to add to itâ€¦ The addition to the training data from some different sources can be small in proportion so that the distribution does not vary drastically. I hope this is correct.