Can adding data hurt?

Hi Sir,

We are having couple of doubts from the below lecture

Video link: https://www.coursera.org/learn/introduction-to-machine-learning-in-production/lecture/nyEVB/can-adding-data-hurt

  1. Digit one versus alphabet I problem, if the mapping from x to y is not clear, humans cannot do very well on it. For such cases mapping is unclear, if we constructing big models low bias, will the algorithm performance get hurts accuracy ?

  2. we are not getting this statement at video 5:39 minutes,
    But I hope that understanding this rare case where it could hypothetically hurt gives you more comfort with using data augmentation or collecting more data to improve the performance of your algorithm, even if it causes your training set distribution to become different from your dev set and test set distribution.

  1. Large models can learn from accurately labelled data which can be properly classified by humans. Even a large model will suffer when we have unclear x->y mapping. This goes to say that so long as the data quality is excellent, large models can learn very well even from fewer examples in certain parts of the data distribution.
  2. When data augmentation is performed, the training set distribution becomes larger. It’s possible that this additional data isn’t present in the dev/test sets to start with (which makes train set data distribution different from dev / test sets). While this is a good thing, including data points that don’t have x->y mapping as part of data augmentation could still hurt model performance.
1 Like

@Anbu @balaji.ambresh makes good points.

Something about point 2-- Well you don’t want to spread your cards too thin; And a point that really should really be a science in itself, but I find often too little discussed, the role of ‘data selection’ cannot be overlooked.

Granted, maybe that is all the data you have, or can get, and then you’re just ‘stuck’.

But let’s say you are trying to identify people from animals-- You might just think, well I have plenty of pictures of people and animals, so I’ll just add more.

Yet a picture of a person has a ton of other features one might not immediately consider (i.e. their gender, their clothes, their background setting, their expression, etc – but even their race).

It makes me think of that famous mistake Google made where image search was recognizing/confusing black people as Gorillas.

Even the distribution you start with is super important, yet so also what you augment with.

Personally, data selection before training should be thought of more and could be considered a Science in itself.