How to handle outliers?

Praveen_Titus_F · November 7, 2022, 4:44pm

Hi there,

In multiple linear regression I wanna know how to handle outliers in both independent and dependent (target) variables, why it wasn’t explained. Like independent variables, dependent variable will also have extreme values, so at times like this how to over come the issue. For example, independent or dependent (target) variable having a range
1-100, where 95% of values lie between 1-40 and the rest 5 % lie above 40.
need help !!

kchong37 · November 7, 2022, 5:07pm

Hi there, good question!
We definitely need to consider the impact of outliers before interpreting the regression result because it may have the potential to bias our regression results.
In my point of view, a good beginning would be to plot x and y to visually check how the data look like and decide if we want to remove the outliers or if it’s actually okay to keep them (when sample size is big enough and the effect from outliers are negligible). Here are some readings regarding this topic: identifying outliers and why should we care about it.

AbdElRhaman_Fakhry · November 7, 2022, 5:08pm

Hi @Praveen_Titus_F
there are many ways you can deal with outliers for example if you data is suffer from skew you can do log() or ln() to normalize your data so it remove outliers .
.or you can do normalization techniques like subtract you data from mean and divide it be standard division .
or if you data have little number of outliers you can remove it and replace it by mean or you can let it to be run on model to accustom the model to the presence of outliers can also be when you apply the model to real life(when you deploy the model)

Another thing. I hope you take a look at statistics, which will be of great help to you in how to deal with anomalies or outliers and many other things.

please feel free to ask any questions,
Thanks,
Abdelrahman

Juan_Olano · November 7, 2022, 5:14pm

I think that what you mention is the problem of Class Imbalance.

Class imbalance is a phenomenon that is very common in real-world applications. Class imbalance is easier to handle in binary classification, and rather complex in multiclass classification.

There are several ways to handle class imbalance that can be based on the selection of the metrics, or working on the data to diminish the imbalance, or working on the algorithm, to have it learn to manage the imbalance.

Each one of these approaches has in itself a few ways to handle. For example, if you decide to approach the problem via the data, you could use Resampling which can be Undersampling or Oversampling.

There’s a lot written around this topic, so to dig deeper into it, just google Data Imbalance and you’ll find a lot of literature.

Hope this is helpful.

Juan

TMosh · November 8, 2022, 6:48am

If you’re doing linear regression and you have a good data set, there really aren’t any outliers.

Or more properly, the outliers can be significantly important, and you may not want to ignore them or minimize them.

Praveen_Titus_F · November 8, 2022, 4:32pm

Thank u all for your responses !
@AbdElRhaman_Fakhry , @kchong37 @TMosh @Juan_Olano
I ve went through all your responses, and found helpful.
Now let me show you all, the scenarios that i m talking about

When feature variables having outliers:
When target variable having outliers

image647×609 20 KB

######How to handle outliers in the above two scenarios ?

Then, after googling i found two way to handle these outliers
i) Imputation with mean / median / mode

ii) Capping
The data points that are lesser than the 10th percentile are replaced with the 10th percentile value and the data points that are greater than the 90th percentile are replaced with 90th percentile value.

What is your opinion on imputation and capping ?

Thank you…

kchong37 · November 8, 2022, 6:01pm

Hi there, for methods handling the outliers, I think as long as we are aware of the impact of both methods, we are fine. For both scenarios, I would compare the regression results with/without outliers to see the impact of the outliers.

Praveen_Titus_F · November 8, 2022, 6:07pm

Hi @kchong37 without outliers the MSE will be low, whereas with outliers the MSE will be relatively large. This is how the impact going to be… any how thanks for ur words…

Juan_Olano · November 8, 2022, 6:09pm

TO think more about this case, could you please share more information on your use case? what is your goal? and what is your data? Knowing this basic information will guide us in helping you think about your challenge.

Praveen_Titus_F · November 8, 2022, 6:19pm

Hi @Juan_Olano Generally in multiple linear regression I face situations like I mentioned in the images, so need a concrete approach to handle outliers in both dependent and independent variables.
Recently found two approaches that I ve mentioned above, ie imputation and capping.
How about these two methods ?

AbdElRhaman_Fakhry · November 8, 2022, 6:25pm

Hi
from the first look I thinks columns(Features) ASia, Americans suffer from skew …so to make sure it sufffer from skew or another thing please plot distribution for each column(Feature) plot like this photo which show the distribution of data and what the data suffer from

after that if the data suffer from skew you can do what I said before like take log() or make normalization to columns

skew distribution like this

Praveen_Titus_F · November 8, 2022, 6:30pm

Thank you @AbdElRhaman_Fakhry
Understood !!
Can I use log() for independent variables if they are skewed.

AbdElRhaman_Fakhry · November 8, 2022, 6:32pm

of course you can use it

Juan_Olano · November 8, 2022, 7:31pm

I think that the method is very linked to the use case. For example, if my dataset is lungs X-rays, I would probably not discard “outliers”. Or if my goal is to dwtermine something for young people, and my dataset contained some people aged above my desired range, then I would probably cut out those cases. So I would say there is no one solution for all cases. I still think it depends on the goal of the model.

TMosh · November 9, 2022, 2:35am

You might also try adding some polynomial features, so that you’re not trying to fit a straight line to that data set.

I don’t see that you have defined what you mean by “outliers”, as opposed to maybe you just have an incomplete data set that doesn’t cover all of the range of possible ‘x’ values.

Absent some knowledge of the characteristics of the data set, there is a substantial risk that you could massage the data so it meets your expectations, rather than reflects the reality of your data.

Topic		Replies	Views
How do we deal with outliers? Sequences, Time Series and Prediction week-1	2	518	October 16, 2022
Machine learning model AI Discussions	17	274	August 19, 2023
In linear regression do we care about independent variable distribution? Supervised ML: Regression and Classification week-1	2	487	December 24, 2022
Removing anomalies from training data Unsupervised Learning, Recommenders, Reinforcement week-1	5	686	September 21, 2022
Rescaling methods for outlier data Supervised ML: Regression and Classification week-2	1	490	July 2, 2022

How to handle outliers?

Related topics