Machine learning model

Nimish_Khandelwal · August 12, 2023, 10:47am

I currently started working on Machine learning model for predicting the Insurance premium given some features such as age,gender,number_of_children’s etc. I am thinking using Linear Regression model for this. So, when I was performing the data cleaning step, I saw that my target variable is having many outliers. So I am not sure how to handle this type of situation.
Kindly help me in this regard.

Christian_Simonis · August 12, 2023, 11:19am

Hi there

The question is, if the target variable is labelled correctly or not. Did you check the plausibility here?

Quoting from this thread: https://community.deeplearning.ai/t/disadvantages-of-feature-scaling/384747/2:

Often here, having a clear strategy how to deal with outliers […] (e.g. winsorizing or clipping if the business problem allows it) is useful.

This also applies to your labels / targets.

Best regards
Christian

Nimish_Khandelwal · August 12, 2023, 1:21pm

Hi,

I think so, that more or less target variable is labelled correctly. So one of the solution here is to try winsorizing on the target variable, if I am getting it correct ?

Thanks,
Nimish

Christian_Simonis · August 12, 2023, 1:53pm

Nope, @Nimish_Khandelwal.
If the targets are labelled correctly we should not manipulate them by clipping or winsorizing them. But of course you can consider transforming the whole labels (e.g. with log scaling), training the model to predict your transformed label, and then revert the transformation to get to your actual label.

I understand the labels are scattering which cannot be explained by the model, but that is the reality then.

Can you show how the model residuals look like?
(See also this thread: True vs predicted values biased-intercept - #4 by Christian_Simonis)

In general points to consider are:

is the scattering of your label caused by the fact they they emerge from different distributions (e.g. let’s assume the insurance premium might be systematic different for a certain kind of characteristics) in that case you could consider to train a separate model for separate labels (e.g. insurance premium for car owners that just got their drivers license and one insurance premium for more experienced drivers)
try to incorporate domain knowledge into your features w/ feature engineering see also this thread: Time Series Linear Regression
dependent on your residual analysis consider other models than a linear model (which will only perform great if you can manage to model all the non-linearity in your features). Maybe Gaussian processes might be worth a look since they also allow to account for uncertainty / confidence and model non-linearity, see also this thread: Deep learning is a small part of ai - #6 by Christian_Simonis

Hope that helps, @Nimish_Khandelwal!

Best
Christian

Christian_Simonis · August 12, 2023, 1:59pm

In addition, here you can find a nice example on the value add of transforming the targets before learning the model also with code examples:

Best regards
Christian

TMosh · August 12, 2023, 6:34pm

The outliers in your data set may be the most important information of all. You have no way of knowing. Better to include them as-is than to alter them.

Nimish_Khandelwal · August 14, 2023, 9:28am

Thanks for the inputs @Christian_Simonis and @TMosh. I created the Linear model and got an average accuracy of around 75% on using cross-validation. I was wondering if I can share the code with you, then you can have a look at the procedure and then guide me for further steps ?

Thanks,
Nimish

Christian_Simonis · August 14, 2023, 10:31am

Hi @Nimish_Khandelwal

could you please show how your residual plots look like for an examply train/test split?

Before sharing the code two points:

I am convinced we have the highest value also for (future) fellow learners if we keep on discussing the questions openly, e.g. your question in this open thread - not in private messages
if you have the rights on that code and you do not violate the code of conduct, you can share your relevant piece of code if that helps to drive the discussion forward

Hope that helps!

Best regards
Christian

TMosh · August 14, 2023, 3:02pm

I am confused. Normally a linear model does not give an accuracy metric. If you’re doing classification, then a linear model is not appropriate.

How are you measuring “accuracy”?

Nimish_Khandelwal · August 17, 2023, 7:43am

Hi @TMosh,

Here, Accuracy represents the R-square value (I used model.score(X_test,y_test) ) i.e., 75% of the variance is being explained by the model.

Thanks,
Nimish

Nimish_Khandelwal · August 17, 2023, 7:49am

Hi @Christian_Simonis,

Here are the residual plots for 2 features i.e., (i) age (ii) bmi.

Thanks,
Nimish

TMosh · August 17, 2023, 3:57pm

Is there an explanation for these discontinuities?

Christian_Simonis · August 17, 2023, 4:40pm

Thanks @Nimish_Khandelwal.

In addition to Tom’s feedback:

I would recommend label each axis so that each plot can easily be understood
assuming you plotted residuals on y axis and the feature on the x axis: your residuals show a clear pattern: this means e.g. with feature engineering you can incorporate that systematic information into your model. The goal should be that the residuals do not show any systematic dependency from your features afterwards - ideally you just see random noise, something like that: https://global.discourse-cdn.com/dlai/original/3X/c/0/c0b9819ef070619a63cffc64b637c1dd598074e8.jpeg

see also this thread: True vs predicted values biased-intercept - #4 by Christian_Simonis

Best regards
Christian

Christian_Simonis · August 17, 2023, 4:46pm

Also:

This can only be answered by someone with domain knowledge but after checking out the plots, i would like to emphasise that question.

By the way: how many features did you use in total, @Nimish_Khandelwal?

Best regards
Christian

Nimish_Khandelwal · August 17, 2023, 6:27pm

I don’t know, but discontinuities might be due to the outliers in the target variable ?

Thanks,
Nimish

Nimish_Khandelwal · August 17, 2023, 6:29pm

Like for the variable ‘age’ we can see that it is more or less randomly distributed. But for the variable ‘bmi’ there is some kind of pattern associated.

In total I have used 4 variables.

Thanks,
Nimish

TMosh · August 17, 2023, 7:02pm

What sort of “cleaning” are you doing?

Nimish_Khandelwal · August 19, 2023, 9:39am

Removing outliers from features (using IQR method), looking for any different values in unique values of variables, removing variables which do not affect response etc.

Topic		Replies	Views
Extra info about week2 - Cleaning up incorrectly label data Structuring Machine Learning Projects	5	541	December 29, 2022
How to handle outliers? Supervised ML: Regression and Classification week-2	14	685	November 9, 2022
Question: week 1, steps of an ML project -2.30 min Introduction to Machine Learning in Production	2	586	May 17, 2021
To Regression or To Classify AI Discussions ai-discussions	2	154	April 27, 2023
Doubt in Feature scaling Supervised ML: Regression and Classification week-2	7	577	November 5, 2022

Machine learning model

Related topics