I currently started working on Machine learning model for predicting the Insurance premium given some features such as age,gender,number_of_children’s etc. I am thinking using Linear Regression model for this. So, when I was performing the data cleaning step, I saw that my target variable is having many outliers. So I am not sure how to handle this type of situation.
Kindly help me in this regard.
Hi there
The question is, if the target variable is labelled correctly or not. Did you check the plausibility here?
Quoting from this thread: https://community.deeplearning.ai/t/disadvantagesoffeaturescaling/384747/2:
Often here, having a clear strategy how to deal with outliers […] (e.g. winsorizing or clipping if the business problem allows it) is useful.
This also applies to your labels / targets.
Best regards
Christian
Hi,
I think so, that more or less target variable is labelled correctly. So one of the solution here is to try winsorizing on the target variable, if I am getting it correct ?
Thanks,
Nimish
Nope, @Nimish_Khandelwal.
If the targets are labelled correctly we should not manipulate them by clipping or winsorizing them. But of course you can consider transforming the whole labels (e.g. with log scaling), training the model to predict your transformed label, and then revert the transformation to get to your actual label.
I understand the labels are scattering which cannot be explained by the model, but that is the reality then.
Can you show how the model residuals look like?
(See also this thread: True vs predicted values biasedintercept  #4 by Christian_Simonis)
In general points to consider are:

is the scattering of your label caused by the fact they they emerge from different distributions (e.g. let’s assume the insurance premium might be systematic different for a certain kind of characteristics) in that case you could consider to train a separate model for separate labels (e.g. insurance premium for car owners that just got their drivers license and one insurance premium for more experienced drivers)

try to incorporate domain knowledge into your features w/ feature engineering see also this thread: Time Series Linear Regression

dependent on your residual analysis consider other models than a linear model (which will only perform great if you can manage to model all the nonlinearity in your features). Maybe Gaussian processes might be worth a look since they also allow to account for uncertainty / confidence and model nonlinearity, see also this thread: Deep learning is a small part of ai  #6 by Christian_Simonis
Hope that helps, @Nimish_Khandelwal!
Best
Christian
In addition, here you can find a nice example on the value add of transforming the targets before learning the model also with code examples:
Best regards
Christian
The outliers in your data set may be the most important information of all. You have no way of knowing. Better to include them asis than to alter them.
Thanks for the inputs @Christian_Simonis and @TMosh. I created the Linear model and got an average accuracy of around 75% on using crossvalidation. I was wondering if I can share the code with you, then you can have a look at the procedure and then guide me for further steps ?
Thanks,
Nimish
could you please show how your residual plots look like for an examply train/test split?
Before sharing the code two points:
 I am convinced we have the highest value also for (future) fellow learners if we keep on discussing the questions openly, e.g. your question in this open thread  not in private messages
 if you have the rights on that code and you do not violate the code of conduct, you can share your relevant piece of code if that helps to drive the discussion forward
Hope that helps!
Best regards
Christian
I am confused. Normally a linear model does not give an accuracy metric. If you’re doing classification, then a linear model is not appropriate.
How are you measuring “accuracy”?
Hi @TMosh,
Here, Accuracy represents the Rsquare value (I used model.score(X_test,y_test) ) i.e., 75% of the variance is being explained by the model.
Thanks,
Nimish
Here are the residual plots for 2 features i.e., (i) age (ii) bmi.
Thanks,
Nimish
Thanks @Nimish_Khandelwal.
In addition to Tom’s feedback:
 I would recommend label each axis so that each plot can easily be understood
 assuming you plotted residuals on y axis and the feature on the x axis: your residuals show a clear pattern: this means e.g. with feature engineering you can incorporate that systematic information into your model. The goal should be that the residuals do not show any systematic dependency from your features afterwards  ideally you just see random noise, something like that: https://global.discoursecdn.com/dlai/original/3X/c/0/c0b9819ef070619a63cffc64b637c1dd598074e8.jpeg
see also this thread: True vs predicted values biasedintercept  #4 by Christian_Simonis
Best regards
Christian
Also:
This can only be answered by someone with domain knowledge but after checking out the plots, i would like to emphasise that question.
By the way: how many features did you use in total, @Nimish_Khandelwal?
Best regards
Christian
I don’t know, but discontinuities might be due to the outliers in the target variable ?
Thanks,
Nimish
Like for the variable ‘age’ we can see that it is more or less randomly distributed. But for the variable ‘bmi’ there is some kind of pattern associated.
In total I have used 4 variables.
Thanks,
Nimish
What sort of “cleaning” are you doing?
Removing outliers from features (using IQR method), looking for any different values in unique values of variables, removing variables which do not affect response etc.