Can adding data hurt?

Anbu · September 16, 2024, 8:35am

Hi Sir,

We are having couple of doubts from the below lecture

Video link: https://www.coursera.org/learn/introduction-to-machine-learning-in-production/lecture/nyEVB/can-adding-data-hurt

Digit one versus alphabet I problem, if the mapping from x to y is not clear, humans cannot do very well on it. For such cases mapping is unclear, if we constructing big models low bias, will the algorithm performance get hurts accuracy ?
we are not getting this statement at video 5:39 minutes,
But I hope that understanding this rare case where it could hypothetically hurt gives you more comfort with using data augmentation or collecting more data to improve the performance of your algorithm, even if it causes your training set distribution to become different from your dev set and test set distribution.

balaji.ambresh · September 16, 2024, 9:53am

Large models can learn from accurately labelled data which can be properly classified by humans. Even a large model will suffer when we have unclear x->y mapping. This goes to say that so long as the data quality is excellent, large models can learn very well even from fewer examples in certain parts of the data distribution.
When data augmentation is performed, the training set distribution becomes larger. It’s possible that this additional data isn’t present in the dev/test sets to start with (which makes train set data distribution different from dev / test sets). While this is a good thing, including data points that don’t have x->y mapping as part of data augmentation could still hurt model performance.

Nevermnd · September 16, 2024, 10:47am

@Anbu @balaji.ambresh makes good points.

Something about point 2-- Well you don’t want to spread your cards too thin; And a point that really should really be a science in itself, but I find often too little discussed, the role of ‘data selection’ cannot be overlooked.

Granted, maybe that is all the data you have, or can get, and then you’re just ‘stuck’.

But let’s say you are trying to identify people from animals-- You might just think, well I have plenty of pictures of people and animals, so I’ll just add more.

Yet a picture of a person has a ton of other features one might not immediately consider (i.e. their gender, their clothes, their background setting, their expression, etc – but even their race).

It makes me think of that famous mistake Google made where image search was recognizing/confusing black people as Gorillas.

Even the distribution you start with is super important, yet so also what you augment with.

Personally, data selection before training should be thought of more and could be considered a Science in itself.

Topic		Replies	Views
Data Augmentation is OK when "model is large" Introduction to Machine Learning in Production	1	540	March 11, 2022
Training and Testing on Different Distributions Structuring Machine Learning Projects	3	659	April 26, 2021
Does Data Augmentation apply only to train data? Introduction to Machine Learning in Production	2	665	July 12, 2021
Week2 doubt- trying out model on subset of data Introduction to Machine Learning in Production	1	548	June 10, 2021
C2_W3_Video: Adding Data Advanced Learning Algorithms week-3	9	274	April 15, 2024

Can adding data hurt?

Related topics