Hello there everyone, this is my first time posting here in this community, but I really needed some advice
I recently took on myself to build a project on ML which included features like Income, Demat, Mortgages and target Loan, this dataset had 5000 rows and included customer information whether they accepted or rejected loan offerings.
After a bit of data cleaning, I did all the general steps and went on to train 8 different models. All of them were getting 96+accuracy in test set after tuning and even kNN classifier got 96% which was very doubtful to me. Mean cv was also coming consistently above 97 score.
I am worried I might be doing something wrong here even though I am using stratified k-fold.
Can anyone tell me what I may be doing wrong?
The score difference between Training and Test set ranges from 0.004 to 0.01.
Yes, it is, not sure what I could do other than over/under sampling to balance the dataset.
On SMOTE, my models performed a lot worse, I tried to manually tune weights on my models.
I assume you are talking about time-series features?
In that case, there are none.
Features are
Income, Mortgage, Demat, NetBanking, Family Members, Age, Pin-code, Fixed Deposit, Education, Experience then Loan.
my correlation matrix shows a bit less but sure correlation between all the features, I have tried recursive feature elimination and found out that even if its small but all features give the best score.
Regarding the TensorFlow tutorial, I am a beginner so I do not think I will be able to convert the method over to the normal ML algorithms I am using.
Since I do not have any experience with NNs and TensorFlow I am not doing it, since I would like study it first.
You donât have to know tensorflow to use a neural network. See MLPClassifier to get started on a neural network.
One thing to keep in mind is the metric you are optimizing for. For imbalanced class distribution, consider optimizing for a metric like F1-score / precision / recall instead of accuracy.
yes, most are null, these were originally yes/no which I converted into 1 and 0s.
0 Mortgage means no mortgage and mortgage does have values but they are less and are present in the midst of the dataset, not in the head I took.
Same with fd and others, they are all yes/no which I converted to 0 and 1s.
Mortgage I did not convert, it came just like this.
I did not try removing them but considering that most of the dataset is like this, I do not think it is worth it.
I am not telling you to remove null values but check before categorising the columns into 1 or 0, how many rows are left. then you would understand how you can approach the split of data with or without the null values.
Well, like I said, the 0s and 1s were yes/no originally which I converted in this form.
Removing rows with 0s in all these cols means the particular customer does not have any fd, does not do net banking, no mortgage, no demat account and I have seen that all people have some sort of education.
So this right here, do depict a certain population of customer base.
But still to see what I get, I did see, how many rows without all 0s are there.
Total of 1240 rows are all 0s out of the 5000.
But again, I already knew this, this depicts a certain populationâs characteristics.
Even with all 0,
1125 did not take loan
114 took loan