Training and Testing on Different Distributions

Hello!

In a video we consider only 2 options to deal with photos from application. However, I’ve noticed the third one: why we can’t apply data set augmentation? What are the drawbacks of this approach in that problem?

Hi Vladimir
I think data augmentation is definitely a possibility there. However, Prof. Ng probably does not talk about it in that particular video because it was about

training and testing on different distributions

. With data augmentation, we are trying to make the training distribution look like the test distribution, and that would be off-topic.

1 Like

Yes, your point makes sense. But what would be the best choice in real practice?

I believe we have already augmented the dataset by using the high quality images from the internet. But I assumed you were talking about blurring or somehow distorting these high quality images to make them look like pictures that users would upload. That is definitely worth doing; but is most effective once you actually hit a data mismatch problem and identify what “mismatches” are causing the problem.
To do this, the first step is to build a model as fast as you can ( by training on the high quality images) and perform error analysis. Once you identify a data mismatch problem, [https://www.coursera.org/learn/machine-learning-projects/lecture/biLiy/addressing-data-mismatch](Week 2 Video 3) discusses how to address this problem. Prof. Ng proposes a systematic, manual error analysis procedure to identify what kind of data the model is getting wrong. Then, you can augment the training data with more confidence.

,

1 Like