Adding Training data which distribution differs from Dev/Test sets

andreygizatullin · August 5, 2021, 5:39pm

Reference: DLS Course 3, Week 1, Quiz question 5

It is mentioned that adding additional Training Set data, which distribution differs from Dev/Test Sets, helps on the Dev Set performance. However, distribution of Dev and Test Sets should be the same.

Could you please give some intuitive (or statistical) explanation that adding to the Train Set additional examples from different distribution helps on the Dev Set performance. I would imagine that it might lead to higher Dev Set error , since Dev Set did not see the new data with the different distribution. Would not it be more sensible to split additional data with different distribution into Training / Dev / Test Sets with the same proportions as applied to original data.

What is your rule of thumb on the amount of additional data with the different distribution you add to the Training Set which will make sense on the Dev Set performance? And how much difference in distribution of additional Training Set as compared to the original Training Set is within acceptable limits to expect performance improvement of the Dev Set?

Appreciate your time addressing these questions.

amitp · August 6, 2021, 6:31am

I also have the same question as yours.
I want to add one more thing. Does it mean that when setting Dev set and Metrics we are keeping the target fixed? When we have additional data from different distribution to training set, are we are not bothered about the target that is fixed?
What it means to have data from same distribution? Apart from the same source is there any statistical measure to compare if data are from same distribution.

manifest · August 7, 2021, 9:04am

Hey guys,

Your intuition is correct.

Note that our goal is not just to make a model that achieve a low error on the test set, we want a model that generalizes well on real-world data (meaning data the model will be used after deployment). We assume that dev and test sets close to that real-world data distribution.

That is also correct. We want to evaluate on data that, we think, close to real-world data.

That said, the more data we train on – the better results. At least as far the data is relevant to the task and our model has capacity to learn from it.

@andreygizatullin @amitp

andreygizatullin · August 7, 2021, 5:18pm

Hi Andrei (@manifest), thanks for replying some of the questions!

Could you please comment on these 3 questions which are still open.

Referring to what @amitp asked. How do you define a distribution for a dataset of pictures? What measure do you normally use for a distribution? Is it type of Variational Autoencoder you train to get distribution measures?
What is your rule of thumb on the amount of additional data with the different distribution you add to the Training Set which will make sense on the Dev Set performance?
How much difference in distribution of additional Training Set as compared to the original Training Set is within acceptable limits to expect performance improvement of the Dev Set? Or said simpler, how different should another distribution be so that you still arrive at performance improvement of the Dev Set? Are you looking at covariance matrix variation between old and new datasets?

Thanks,
Andrey

manifest · August 9, 2021, 2:20pm

I’m not sure that I follow the first question. As for the second and third, why don’t you use an evaluation metric to compare performance of your model on different training data?

andreygizatullin · August 9, 2021, 6:41pm

@manifest

With regards to the first question. From basic statistics the measures of distribution for time series are: mean, standard deviation, shape of probability distribution function, etc. What are the measures of distribution for pictures dataset?

To questions 2 and 3. Sure we may use evaluation metric to identify the difference. But it means trial and error and waste of time and resources. Does it make sense to retrain the model for every new dataset without preliminary thought/estimate of sensibility of doing so? Will it not be possible for you to get some distribution measures of the new dataset so that it’ll spare you from useless retraining?

manifest · August 11, 2021, 8:46am

In ML, by distribution we assume a probability distribution. That means to describe it, we use probability mass function (for discrete random variables) and probability density function (for continuous random variables).

The problem here is that this comparison of data distributions would only make sense if we compare the distribution of a new data to the real-world data distribution. But if we knew the real-world data distribution, we wouldn’t need a neural network. It would be possible to solve the problem using some traditional optimization algorithm.

Test and dev sets our best attempt to match the real-world data based solely on understanding of a particular task we are trying to solve.

With an evaluation metric we are able to compare performance of our solutions

andreygizatullin · August 11, 2021, 8:48am

Understand, thank you for your answer!

saiman · August 11, 2021, 1:34pm

Thanks. would you mind please explain it more.

andreygizatullin · August 11, 2021, 2:07pm

Just point of view, trying to understand the “distribution” of the dataset we sort of cross the border of intuition here and need statistical and probabilistic math tools. I would guess that to learn a distribution of the dataset you might use variational autoencoders (VAE) link[Tutorial #5: variational autoencoders]. But in itself it is another ANN which needs to be trained. Maybe @manifest may suggest how he uses VAEs (or other tools) to do assessment of distribution prior to large scale model retraining. I guess this thread becomes too advanced for DLS level.

manifest · August 11, 2021, 2:44pm

We build neural networks to model the real-world data distribution. If we already know that distribution, we don’t need neural networks. In other words, if we knew everything that may happen, there would be nothing to predict

Evaluation of generative models quite a problem itself

hyder · August 12, 2021, 9:57pm

Practically, we would need to inspect all citizens’ data, as there is a big chance people would end up uploading inaccurate pictures whether unintentionally or intentionally (adversarial attack initiated by some people who like birds), taking into consideration that we already have 10 million pictures from a trusted source with some level of accuracy, it wouldn’t be feasible or at least necessary to go over the 1 million citizens’ pictures, would it?

Technically, if citizens’ pictures distribution is noticeably different from security camera pictures distribution, say almost all security camera pictures are horizontal while almost all citizens’ pictures are vertically pointed to the sky, my intuition is, adding citizens’ data to training data would create unnecessary bias for dev/test data that regularization won’t be able to fix, am I missing something here?

last thing, by having a fairly big amount of security camera pictures, which will be our source of data in real world, why can’t we consider security camera pictures distribution as the real world distribution or very close to it?

Your input is highly appreciated!

thearkamitra · August 14, 2021, 6:55pm

Hey,

Another point I would like to add: the model weights are usually saved based on the validation or test set. Thus we would want the test sets to be as similar to the validation set.

Another reason is the following: suppose you have a different distribution of the number of samples from each class in the dev set. Let us say 90 of class 0 and 10 of class 1. The model is great in predicting class 0 and predicts with 100 perc accuracy and it predicts class 1 with 50 perc accuracy. Then the average accuracy in the dev set would be 95. However, the test set might have 10 samples from class 0 and 90 from class 1. Now, the accuracy suddenly dropped to around 55 percent from 95 percent. Therefore, we must try to have the same distribution in the dev and test set.
The problem of a test sample being a completely random image which is not related to the data it has been trained on is called out of distribution detection. OOD is an active research field right now!

quicksilver · September 15, 2021, 6:59am

What if we use fine-tuning (transfer learning) approach and use the larger dataset (e.g. high def cat images which are larger in number) for training a base model and then fine-tune on the blurred images which are probably less in number. Would this be any helpful?

I think when the sizes of different datasets (high-def and blurred) are similar, then this might not help very much.

Any thoughts?

Aroonima · October 7, 2021, 7:45am

So I understand it this way. If we have a lot of images of birds from a single source eg. a bird photographer, it is a good idea to enhance our data and include images taken by, let’s say tourist, environmentalist, or institutes or even online images

Aroonima · October 7, 2021, 8:31am

Just to add to it… The addition to the training data from some different sources can be small in proportion so that the distribution does not vary drastically. I hope this is correct.

patrick.zimmerman · December 9, 2024, 6:12pm

I’d like to resurrect this thread to get some current commentary from the course assistants. There’s a lot of discussion in this course about adding a relatively large set of observations that come from a different distribution than your intended use case. Andrew’s guidance is basically to keep these data in your training set, and put your relatively smaller set that corresponds to your use case either totally in dev & test, or to distribute some into train. Two basic questions:

Have attitudes changed on this in the last few years? Would a foundation model approach be more attractive now to pre-train on the larger set and then fine-tune on the smaller set? Or would that only be if the larger set were unlabeled and so you would just be able to do unsupervised pre-training with it? Whereas, in the setting here with a large labeled set, it’s still better to add it right in to the same training set and just fit one model?
If you choose to put all of the larger data set in train, and distribute the smaller but more relevant data set across train, dev, & test, should you use a weighted cost function to give larger weights to cases from the smaller data set and smaller weights to the cases from the larger (less relevant) data?

Topic		Replies	Views
Course 3 Week 1 quiz Structuring Machine Learning Projects coursera-platform	1	567	June 25, 2022
Week1 quizz: very confused about train/dev/test set and when to add new data to which set Structuring Machine Learning Projects week-module-1 , coursera-platform	2	395	February 1, 2024
Do we need training and dev/test data to come of the same distribution? Structuring Machine Learning Projects coursera-platform	2	658	May 5, 2022
The consequence of different distribution in train dev and test Structuring Machine Learning Projects coursera-platform	1	771	May 22, 2021
Training and set distribution clarification. C3 W1 Structuring Machine Learning Projects coursera-platform	7	355	November 16, 2023

Adding Training data which distribution differs from Dev/Test sets

Related topics