How to handle outliers?

Hi there,

In multiple linear regression I wanna know how to handle outliers in both independent and dependent (target) variables, why it wasn’t explained. Like independent variables, dependent variable will also have extreme values, so at times like this how to over come the issue. For example, independent or dependent (target) variable having a range
1-100, where 95% of values lie between 1-40 and the rest 5 % lie above 40.
need help !!

Hi there, good question!
We definitely need to consider the impact of outliers before interpreting the regression result because it may have the potential to bias our regression results.
In my point of view, a good beginning would be to plot x and y to visually check how the data look like and decide if we want to remove the outliers or if it’s actually okay to keep them (when sample size is big enough and the effect from outliers are negligible). Here are some readings regarding this topic: identifying outliers and why should we care about it.

1 Like

Hi @Praveen_Titus_F
there are many ways you can deal with outliers for example if you data is suffer from skew you can do log() or ln() to normalize your data so it remove outliers .
.or you can do normalization techniques like subtract you data from mean and divide it be standard division .
or if you data have little number of outliers you can remove it and replace it by mean or you can let it to be run on model to accustom the model to the presence of outliers can also be when you apply the model to real life(when you deploy the model)

Another thing. I hope you take a look at statistics, which will be of great help to you in how to deal with anomalies or outliers and many other things.

please feel free to ask any questions,
Thanks,
Abdelrahman

1 Like

I think that what you mention is the problem of Class Imbalance.

Class imbalance is a phenomenon that is very common in real-world applications. Class imbalance is easier to handle in binary classification, and rather complex in multiclass classification.

There are several ways to handle class imbalance that can be based on the selection of the metrics, or working on the data to diminish the imbalance, or working on the algorithm, to have it learn to manage the imbalance.

Each one of these approaches has in itself a few ways to handle. For example, if you decide to approach the problem via the data, you could use Resampling which can be Undersampling or Oversampling.

There’s a lot written around this topic, so to dig deeper into it, just google Data Imbalance and you’ll find a lot of literature.

Hope this is helpful.

Juan

If you’re doing linear regression and you have a good data set, there really aren’t any outliers.

Or more properly, the outliers can be significantly important, and you may not want to ignore them or minimize them.

Thank u all for your responses !
@AbdElRhaman_Fakhry , @kchong37 @TMosh @Juan_Olano
I ve went through all your responses, and found helpful.
Now let me show you all, the scenarios that i m talking about

  1. When feature variables having outliers:
    image

  2. When target variable having outliers

######How to handle outliers in the above two scenarios ?

Then, after googling i found two way to handle these outliers
i) Imputation with mean / median / mode

ii) Capping
The data points that are lesser than the 10th percentile are replaced with the 10th percentile value and the data points that are greater than the 90th percentile are replaced with 90th percentile value.

What is your opinion on imputation and capping ?

Thank you…

Hi there, for methods handling the outliers, I think as long as we are aware of the impact of both methods, we are fine. For both scenarios, I would compare the regression results with/without outliers to see the impact of the outliers.

Hi @kchong37 without outliers the MSE will be low, whereas with outliers the MSE will be relatively large. This is how the impact going to be… any how thanks for ur words…

TO think more about this case, could you please share more information on your use case? what is your goal? and what is your data? Knowing this basic information will guide us in helping you think about your challenge.

Hi @Juan_Olano Generally in multiple linear regression I face situations like I mentioned in the images, so need a concrete approach to handle outliers in both dependent and independent variables.
Recently found two approaches that I ve mentioned above, ie imputation and capping.
How about these two methods ?

Hi
from the first look I thinks columns(Features) ASia, Americans suffer from skew …so to make sure it sufffer from skew or another thing please plot distribution for each column(Feature) plot like this photo which show the distribution of data and what the data suffer from
image

after that if the data suffer from skew you can do what I said before like take log() or make normalization to columns

skew distribution like this
image

1 Like

Thank you @AbdElRhaman_Fakhry
Understood !!
Can I use log() for independent variables if they are skewed.

of course you can use it

1 Like

I think that the method is very linked to the use case. For example, if my dataset is lungs X-rays, I would probably not discard “outliers”. Or if my goal is to dwtermine something for young people, and my dataset contained some people aged above my desired range, then I would probably cut out those cases. So I would say there is no one solution for all cases. I still think it depends on the goal of the model.

1 Like

You might also try adding some polynomial features, so that you’re not trying to fit a straight line to that data set.

I don’t see that you have defined what you mean by “outliers”, as opposed to maybe you just have an incomplete data set that doesn’t cover all of the range of possible ‘x’ values.

Absent some knowledge of the characteristics of the data set, there is a substantial risk that you could massage the data so it meets your expectations, rather than reflects the reality of your data.

1 Like