Course 3 Week 2 Miss match data

In addressing data mismatch, we were talking about synthesizing data of 10000 hours of normal voice + 1 hour of car voice
So if it means we only get 1 hour of data with in car voice and remaining is just normal voice, so that’s the NN get overfitted ?? or it is something else.
Sir aslo said we repeat the car noise 1000 times so it will be 10000 hours of car noise,
so at final we get a in car noice of 10000 hours , so how it will overfit if all the data is similar ??

Hello @Termu,

Let’s take a look again at what Andrew had said:

one thing you could try is take this one hour of car noise and repeat it 10,000 times in order to add to this 10,000 hours of data recorded against a quiet background. If you do that, the audio will sound perfectly fine to the human ear, but there is a chance, there is a risk that your learning algorithm will over fit to the one hour of car noise.

He said “there is a chance, there is a risk” that it can overfit. Therefore, using only 1 hour of car noise isn’t a sufficient condition for overfitting, but it may or may not happen. Andrew further explained why it may happen with the following illustration:


I think he meant that the 1 hour of car noise we collected might turn out to be not representative at all. For example, if person A collected one hour of car noise when they drove alone, then that hour of car noise CANNOT represent car noise when there are passengers like a family of 2 kids.

If we had used person A’s one-hour car noise, then our model is vulnerable to overfit itself to the kind of noise that person A had collected, and failed to recognize when noises by a family is present.


Thank you for the response, I also had the same intuition.