Why won’t removing a random set of training examples avoid overfitting in Classification Problems?
Due to Random Training Examples the Algorithm will go out of the way to fit those examples in the curve also leading to future examples located near those Random examples which might not be random (Have the opposite value to the previous Random Examples) to fall into possibly the wrong classification?
PFA 2 images of 1) The example of Overfitting in the lecture & 2) My suggested solution which includes deleting the random examples (Shown in Black) & getting a curve which seems more accurate maybe
Hello @Gopal_Ram, interesting. I think it is critical to know how would we remove those samples if our samples have 10 features? Inspection by eye is unlikely. Any ideas?
Also, if we are selecting which data points can survive, can we make sure that the so-trained model is not biased to our choices or perferences? We certainly want a model that is generalizable to the real world data.
Hey @Gopal_Ram,
Welcome to the community. That’s an interesting question indeed!
And I agree with Raymond on this. In my opinion, instead of removing random training examples, you have essentially removed the outliers for each of the classes (sort of), so that the algorithm even when over-fitting doesn’t take these extreme samples into account. Now, theoretically, you could detect these with Anomaly Detection algorithms (discussed in Course 3 of MLS), but that will lead to a plethora of other undesirable factors, including excess computation, excess training time, excess complexity, etc. I guess trying to overcome over-fitting by other techniques such as regularization, dropout, ensembling, etc might be more attractive.
Another aspect of this discussion is what if these points are naturally occurring outliers, i.e., some of these points exist in the dev/test sets as well. In that case, would you want to remove these outliers from your training set in the first place? Perhaps an algorithm might learn to classify these points correctly without over-fitting, for instance, after adding a new feature that distinguishes these points, or like after removing a highly correlated feature and so on.
For example, you chose to remove the 2 blue circles, why wouldn’t you choose to remove the 4 red crosses instead? If there were not just 2 blue circles being surrounded, but a total of 4, how would you decide which color to remove?