Can Removing Random Training examples in Classification Problems lead to a better Generalised Fit?

Gopal_Ram · July 20, 2022, 9:19pm

Why won’t removing a random set of training examples avoid overfitting in Classification Problems?

Due to Random Training Examples the Algorithm will go out of the way to fit those examples in the curve also leading to future examples located near those Random examples which might not be random (Have the opposite value to the previous Random Examples) to fall into possibly the wrong classification?

PFA 2 images of 1) The example of Overfitting in the lecture & 2) My suggested solution which includes deleting the random examples (Shown in Black) & getting a curve which seems more accurate maybe

Please help me out cheers!

WhatsApp Image 2022-07-21 at 2.46.09 AM

rmwkwok · July 21, 2022, 1:58am

Hello @Gopal_Ram, interesting. I think it is critical to know how would we remove those samples if our samples have 10 features? Inspection by eye is unlikely. Any ideas?

Also, if we are selecting which data points can survive, can we make sure that the so-trained model is not biased to our choices or perferences? We certainly want a model that is generalizable to the real world data.

Raymond

Elemento · July 21, 2022, 5:48am

Hey @Gopal_Ram,
Welcome to the community. That’s an interesting question indeed!

And I agree with Raymond on this. In my opinion, instead of removing random training examples, you have essentially removed the outliers for each of the classes (sort of), so that the algorithm even when over-fitting doesn’t take these extreme samples into account. Now, theoretically, you could detect these with Anomaly Detection algorithms (discussed in Course 3 of MLS), but that will lead to a plethora of other undesirable factors, including excess computation, excess training time, excess complexity, etc. I guess trying to overcome over-fitting by other techniques such as regularization, dropout, ensembling, etc might be more attractive.

Another aspect of this discussion is what if these points are naturally occurring outliers, i.e., some of these points exist in the dev/test sets as well. In that case, would you want to remove these outliers from your training set in the first place? Perhaps an algorithm might learn to classify these points correctly without over-fitting, for instance, after adding a new feature that distinguishes these points, or like after removing a highly correlated feature and so on.

Let me know if this helps

Cheers,
Elemento

Gopal_Ram · July 21, 2022, 8:21am

Hey, thanks for the reply. Can you please elaborate on the Second Para. of your answer i.e. About the Bias, I did not understand that completely.

Gopal_Ram · July 21, 2022, 8:23am

Hey, thanks for the reply. I pretty much have understood your answer.
Really nice interacting with you. Cheers!

rmwkwok · July 21, 2022, 8:42am

For example, you chose to remove the 2 blue circles, why wouldn’t you choose to remove the 4 red crosses instead? If there were not just 2 blue circles being surrounded, but a total of 4, how would you decide which color to remove?

So, your choice can change the boundary.

Raymond

Gopal_Ram · July 21, 2022, 9:08am

Ahh, beautifully explained. Now I understand completely. Thank you for the reply. Great interacting with you. Cheers!

Topic		Replies	Views
Addressing Overfitting Supervised ML: Regression and Classification week-3	2	512	July 11, 2022
Removing anomalies from training data Unsupervised Learning, Recommenders, Reinforcement week-1	5	675	September 21, 2022
Add more Training Data to prevent overfitting Supervised ML: Regression and Classification week-3	2	472	January 12, 2023
Accuracy of the model Sequence Models	1	521	June 6, 2021
Why do we need a lot examples to train a ML model? Unsupervised Learning, Recommenders, Reinforcement week-3	2	491	August 6, 2022

Can Removing Random Training examples in Classification Problems lead to a better Generalised Fit?

Related topics