I’ve been working on improving my submission for this month’s Loan Processing competition, but I’m stuck. So far, I’ve tried calculating Mutual Information Scores and correlations to evaluate feature dependencies and generate new features. After data cleaning, I applied several models, including a Neural Network, Decision Tree, Random Forest, and K-Neighbors. Despite my efforts, my ROC AUC score remains between 0.75 and 0.85, depending on the test split.
I’d really appreciate any pointers on areas I may have overlooked or alternative approaches to enhance model performance. If anyone has experience with similar problems or could suggest advanced feature engineering techniques or tuning tips, I’d be grateful!
Thank you so much for your help!
Here is a link to my notebooks, they are a little messy in order apologies for that.
I worked the model as a regression problem considering all the datapoints. It didn’t occur to me that we could carry out classification straightaway. My apologies I have only been working with Images and Image Classification for so long hence I didnt know how to approach the problem.
The competition’s over at this point, but if you want to check out my XGBoost model, I ended up with ~0.955 ROC AUC for my public score. You can check out my annotated model/set up on my
If I understand correctly this dataset provides historical data on whether past loans were either approved or not. Useful enough I suppose for learning to build models from tabular data. Not a very interesting problem from a business perspective, though. What a business would want to predict is not whether a loan will be approved or not, but whether one should be approved.
What models built on this data predict is the outcome of a business process. What is more valuable is predicting the business risk or expected financial return. What, if anything, in your model would need to change to do that?
It’s a good question. I think, given the same dataset, instead of predicting loan approval for each borrower, I would seek to predict either the financial risk or expected return from taking on their loan.
This would mean engineering new target features, e.g.,
Probability of Default (PD): Estimate the likelihood that a borrower will default on the loan.
Loss Given Default (LGD): Calculate the expected loss if a default occurs.
Exposure at Default (EAD): Determine the total value at risk at the time of default.
Expected Return: Combine PD, LGD, and EAD to estimate the expected financial return for each loan.
The exercise as originally defined seems like a shiny toy for IT to tinker with. Measures like the ones you identified, however, can be tied directly to profitability and risk mitigation, which in turn justifies investment in AI/ML by a business. People frequently ask the community for ideas about ML projects; I’d suggest always look for the opportunity to directly impact metrics the C-suite cares about.
@ai_curiousunforunately I’d have to agree with you-- I mean they are not always the most interesting subset of problems, or even worse, for those of us that had to go through the subprime mortgage crisis-- I’d just gotten out of grad school, and then the world collapsed-- the not most ‘moral’, but perhaps the most ‘profitable’.
<rant>
I would argue that applying technology to deciding whether a loan should be made is a much more interesting problem to study than whether a loan will be made. And the idea of looking for ML applications that directly impact an organization’s attainment of its objectives is relevant in non-financial domains as well; predicting hospital acquired infections or readmissions, crop yield, train derailments, etc in my opinion all moral and profitable in the broadest possible sense of that word.
Organizations apply means to achieve ends, so AI strategy and tactics should always be in the service of achieving those ends. I suggest building a model that predicts loan approval is not a good examplar. Merely changing the name of the labels column from approved to profitable or defaulted or some other real business-metric-related outcome would create a better learning experience.
</rant>
If the system is built for a business, it must be driven by business objectives…
and
A pattern I see in many short-lived projects is that data scientists become focused on hacking ML metrics without paying attention to business metrics. Their managers, however, only care about business metrics, and, after failing to see how an ML project can help push their business metrics, kill the projects prematurely (and possibly let go of the data science team involved)
If my data science/ML team came to me and showed off their new system to predict whether the company would approve a loan, I don’t think letting them go would be at all premature. All cost, no benefit? Don’t need an ML system to help make that decision.
If the ML task done by the team can be replaced with a 3rd party API with better results at a lower price.
If the ML team has low maturity in the problem domain and is struggling to meet deadlines with insights that fail to deliver expected business value.
From an explanation standpoint, if a loan application got rejected and the ML team is unable to tell the reason for rejection, that’s going to affect the business. Some teams prefer simpler models / rules for this reason.