I’ve been working on improving my submission for this month’s Loan Processing competition, but I’m stuck. So far, I’ve tried calculating Mutual Information Scores and correlations to evaluate feature dependencies and generate new features. After data cleaning, I applied several models, including a Neural Network, Decision Tree, Random Forest, and K-Neighbors. Despite my efforts, my ROC AUC score remains between 0.75 and 0.85, depending on the test split.

I’d really appreciate any pointers on areas I may have overlooked or alternative approaches to enhance model performance. If anyone has experience with similar problems or could suggest advanced feature engineering techniques or tuning tips, I’d be grateful!

Thank you so much for your help!
Here is a link to my notebooks, they are a little messy in order apologies for that.

I worked the model as a regression problem considering all the datapoints. It didn’t occur to me that we could carry out classification straightaway. My apologies I have only been working with Images and Image Classification for so long hence I didnt know how to approach the problem.

The competition’s over at this point, but if you want to check out my XGBoost model, I ended up with ~0.955 ROC AUC for my public score. You can check out my annotated model/set up on my

If I understand correctly this dataset provides historical data on whether past loans were either approved or not. Useful enough I suppose for learning to build models from tabular data. Not a very interesting problem from a business perspective, though. What a business would want to predict is not whether a loan will be approved or not, but whether one should be approved.

What models built on this data predict is the outcome of a business process. What is more valuable is predicting the business risk or expected financial return. What, if anything, in your model would need to change to do that?

It’s a good question. I think, given the same dataset, instead of predicting loan approval for each borrower, I would seek to predict either the financial risk or expected return from taking on their loan.

This would mean engineering new target features, e.g.,

Probability of Default (PD): Estimate the likelihood that a borrower will default on the loan.

Loss Given Default (LGD): Calculate the expected loss if a default occurs.

Exposure at Default (EAD): Determine the total value at risk at the time of default.

Expected Return: Combine PD, LGD, and EAD to estimate the expected financial return for each loan.