Building ML model for increasing loan acceptance rate by targeting specific customers

data_boy_su · September 5, 2024, 1:01pm

Hello there everyone, this is my first time posting here in this community, but I really needed some advice

I recently took on myself to build a project on ML which included features like Income, Demat, Mortgages and target Loan, this dataset had 5000 rows and included customer information whether they accepted or rejected loan offerings.
After a bit of data cleaning, I did all the general steps and went on to train 8 different models. All of them were getting 96+accuracy in test set after tuning and even kNN classifier got 96% which was very doubtful to me. Mean cv was also coming consistently above 97 score.
I am worried I might be doing something wrong here even though I am using stratified k-fold.
Can anyone tell me what I may be doing wrong?
The score difference between Training and Test set ranges from 0.004 to 0.01.

balaji.ambresh · September 6, 2024, 3:38pm

What’s the train / test split ratio?
What’s the distribution of labels?

data_boy_su · September 7, 2024, 2:07pm

80/20, 90/10, 75/25 tested till now. The question although is for 80/20 pair.
11+1 features: 7 continuous, 5 categorical. Target Loan is in yes/no(converted to 1/0). 1 of the cont is ID which is a redundant feature.

balaji.ambresh · September 7, 2024, 3:11pm

How many instances of each of the labels do you have in the dataset?

data_boy_su · September 7, 2024, 3:50pm

4500 - 0
500 - 1

balaji.ambresh · September 7, 2024, 4:08pm

What are your thoughts on these?

and

TMosh · September 7, 2024, 4:17pm

If your model always predicts 0, you will have 90% success.
This is a quite skewed data set.

Deepti_Prasad · September 7, 2024, 4:21pm

are you doing stratified random sampling??then it is a good approach.

Loan is time-based analysis, so does your data contain any specific feature while you encode the labels

Also divide the features into dependent and independent variables, which will help you categories then better when you use them in your model.

Also 5000 rows of data, then you are using less labels if I consider 80/20 split.

Regards
DP

data_boy_su · September 7, 2024, 4:25pm

I tried ADASYN and SMOTE, i.e. I tried oversampling 1s but I guess the resulting synthetic data must be poor since all my models performed so worse.

data_boy_su · September 7, 2024, 4:27pm

Yes, it is, not sure what I could do other than over/under sampling to balance the dataset.
On SMOTE, my models performed a lot worse, I tried to manually tune weights on my models.

data_boy_su · September 7, 2024, 4:36pm

yes, I am doing that.

I assume you are talking about time-series features?
In that case, there are none.
Features are
Income, Mortgage, Demat, NetBanking, Family Members, Age, Pin-code, Fixed Deposit, Education, Experience then Loan.

my correlation matrix shows a bit less but sure correlation between all the features, I have tried recursive feature elimination and found out that even if its small but all features give the best score.

What do you suggest I do in this case?

data_boy_su · September 7, 2024, 4:39pm

Regarding the TensorFlow tutorial, I am a beginner so I do not think I will be able to convert the method over to the normal ML algorithms I am using.
Since I do not have any experience with NNs and TensorFlow I am not doing it, since I would like study it first.

Deepti_Prasad · September 7, 2024, 4:55pm

how many columns are there, honestly these features aren’t enough.

your yes/no is based on relation between income and expenses??

you could also create a classifier between income and mortgage, where netbanking would be your feature checkpoint for your data analysis

which are categorical columns age? that’s all and demat column provides what information?

data_boy_su · September 7, 2024, 5:03pm

Mortgage has high values but they are less, its not binary, its continuous.
Except for it, everything is as shown.

cat col was only Education, sorry for typing wrong, which I have already encoded into three separate cols.
binary = demat, net b, fd
and rest are cont

those are the total columns

Can you explain how this will help me in my problem?

yes/no → 1/0 loan target is based on those features with income always being the most important feature. The most.

balaji.ambresh · September 7, 2024, 5:06pm

You don’t have to know tensorflow to use a neural network. See MLPClassifier to get started on a neural network.

One thing to keep in mind is the metric you are optimizing for. For imbalanced class distribution, consider optimizing for a metric like F1-score / precision / recall instead of accuracy.

Deepti_Prasad · September 7, 2024, 5:17pm

hi @data_boy_su

your most of the columns seems to be null other than income? that’s honestly the reason for worsening of outcome.

even mortgage or fd is 0

did you do a check of how many rows are left after you remove these null values?

data_boy_su · September 7, 2024, 5:30pm

yes, most are null, these were originally yes/no which I converted into 1 and 0s.
0 Mortgage means no mortgage and mortgage does have values but they are less and are present in the midst of the dataset, not in the head I took.
Same with fd and others, they are all yes/no which I converted to 0 and 1s.
Mortgage I did not convert, it came just like this.
I did not try removing them but considering that most of the dataset is like this, I do not think it is worth it.

data_boy_su · September 7, 2024, 5:33pm

Yes, I did use mlpc too but tuning it did not seem to make it score as high as gradient boosting or even tree based models.
so I left it for now.

Yes, I tried that too.

Deepti_Prasad · September 7, 2024, 5:44pm

I am not telling you to remove null values but check before categorising the columns into 1 or 0, how many rows are left. then you would understand how you can approach the split of data with or without the null values.

data_boy_su · September 8, 2024, 2:49am

Well, like I said, the 0s and 1s were yes/no originally which I converted in this form.
Removing rows with 0s in all these cols means the particular customer does not have any fd, does not do net banking, no mortgage, no demat account and I have seen that all people have some sort of education.
So this right here, do depict a certain population of customer base.
But still to see what I get, I did see, how many rows without all 0s are there.

Total of 1240 rows are all 0s out of the 5000.
But again, I already knew this, this depicts a certain population’s characteristics.

Even with all 0,
1125 did not take loan
114 took loan

Topic		Replies	Views
Over 90% accuracy but wrong predictions AI Discussions ai-discussions	14	1004	April 16, 2024
Training set label distribution AI Discussions ai-discussions , data-centric	2	67	January 3, 2022
Looking for help with Coursera Guided Project: Data Science Coding Challenge: Loan Default Prediction AI Discussions ai-discussions , openai , project , ai-question	6	210	May 21, 2024
Data Splittting Strategy in Supervised ML Supervised ML: Regression and Classification week-3	16	277	June 17, 2025
Class imbalance problem AI Discussions	4	85	May 14, 2021

Building ML model for increasing loan acceptance rate by targeting specific customers

Related topics