Loan Processing | Kaggle Community Competition

Godugu_Anil_Himam · October 31, 2024, 6:05am

Hi Kaggle Community,

I’ve been working on improving my submission for this month’s Loan Processing competition, but I’m stuck. So far, I’ve tried calculating Mutual Information Scores and correlations to evaluate feature dependencies and generate new features. After data cleaning, I applied several models, including a Neural Network, Decision Tree, Random Forest, and K-Neighbors. Despite my efforts, my ROC AUC score remains between 0.75 and 0.85, depending on the test split.

I’d really appreciate any pointers on areas I may have overlooked or alternative approaches to enhance model performance. If anyone has experience with similar problems or could suggest advanced feature engineering techniques or tuning tips, I’d be grateful!

Thank you so much for your help!
Here is a link to my notebooks, they are a little messy in order apologies for that.

balaji.ambresh · October 31, 2024, 4:09pm

Why is the loss set to mse with regards to this snippet?

simple_reg.compile(
    loss="mse",
    metrics=[tf.keras.metrics.RootMeanSquaredError()],
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4)
)

Godugu_Anil_Himam · November 1, 2024, 6:54am

I worked the model as a regression problem considering all the datapoints. It didn’t occur to me that we could carry out classification straightaway. My apologies I have only been working with Images and Image Classification for so long hence I didnt know how to approach the problem.

brainhelper · November 4, 2024, 3:38am

The competition’s over at this point, but if you want to check out my XGBoost model, I ended up with ~0.955 ROC AUC for my public score. You can check out my annotated model/set up on my

GitHub page

ai_curious · November 4, 2024, 1:13pm

If I understand correctly this dataset provides historical data on whether past loans were either approved or not. Useful enough I suppose for learning to build models from tabular data. Not a very interesting problem from a business perspective, though. What a business would want to predict is not whether a loan will be approved or not, but whether one should be approved.

What models built on this data predict is the outcome of a business process. What is more valuable is predicting the business risk or expected financial return. What, if anything, in your model would need to change to do that?

brainhelper · November 4, 2024, 5:22pm

It’s a good question. I think, given the same dataset, instead of predicting loan approval for each borrower, I would seek to predict either the financial risk or expected return from taking on their loan.

This would mean engineering new target features, e.g.,

Probability of Default (PD): Estimate the likelihood that a borrower will default on the loan.
Loss Given Default (LGD): Calculate the expected loss if a default occurs.
Exposure at Default (EAD): Determine the total value at risk at the time of default.
Expected Return: Combine PD, LGD, and EAD to estimate the expected financial return for each loan.

ai_curious · November 5, 2024, 8:36pm

The exercise as originally defined seems like a shiny toy for IT to tinker with. Measures like the ones you identified, however, can be tied directly to profitability and risk mitigation, which in turn justifies investment in AI/ML by a business. People frequently ask the community for ideas about ML projects; I’d suggest always look for the opportunity to directly impact metrics the C-suite cares about.

Nevermnd · November 6, 2024, 3:43am

@ai_curious unforunately I’d have to agree with you-- I mean they are not always the most interesting subset of problems, or even worse, for those of us that had to go through the subprime mortgage crisis-- I’d just gotten out of grad school, and then the world collapsed-- the not most ‘moral’, but perhaps the most ‘profitable’.

I haven’t had a chance to finish it yet, but thus far I’d highly recommend Chip Hyuen’s Designing Machine Learning Systems[Book].

It is kind of both a guide and a ‘wake up call’ for those wishing to actually practice.

ai_curious · November 6, 2024, 1:14pm

<rant>
I would argue that applying technology to deciding whether a loan should be made is a much more interesting problem to study than whether a loan will be made. And the idea of looking for ML applications that directly impact an organization’s attainment of its objectives is relevant in non-financial domains as well; predicting hospital acquired infections or readmissions, crop yield, train derailments, etc in my opinion all moral and profitable in the broadest possible sense of that word.

Organizations apply means to achieve ends, so AI strategy and tactics should always be in the service of achieving those ends. I suggest building a model that predicts loan approval is not a good examplar. Merely changing the name of the labels column from approved to profitable or defaulted or some other real business-metric-related outcome would create a better learning experience.
</rant>

Nevermnd · November 6, 2024, 2:08pm

@ai_curious oops, I forget the important term not before moral, but then I was writing, tired from work, in more innocent times…

ai_curious · November 6, 2024, 2:55pm

In Chapter 2, the author states

If the system is built for a business, it must be driven by business objectives…

and

A pattern I see in many short-lived projects is that data scientists become focused on hacking ML metrics without paying attention to business metrics. Their managers, however, only care about business metrics, and, after failing to see how an ML project can help push their business metrics, kill the projects prematurely (and possibly let go of the data science team involved)

If my data science/ML team came to me and showed off their new system to predict whether the company would approve a loan, I don’t think letting them go would be at all premature. All cost, no benefit? Don’t need an ML system to help make that decision.

Nevermnd · November 6, 2024, 8:30pm

@ai_curious she also gets into, despite expectations, MLis the easiest deposable team.

My recommendation stands.

balaji.ambresh · November 7, 2024, 3:54am

Is true for 3 reasons (I haven’t read the book):

If the ML task done by the team can be replaced with a 3rd party API with better results at a lower price.
If the ML team has low maturity in the problem domain and is struggling to meet deadlines with insights that fail to deliver expected business value.
From an explanation standpoint, if a loan application got rejected and the ML team is unable to tell the reason for rejection, that’s going to affect the business. Some teams prefer simpler models / rules for this reason.

Topic		Replies	Views
Looking for help with Coursera Guided Project: Data Science Coding Challenge: Loan Default Prediction AI Discussions ai-discussions , openai , project , ai-question	6	207	May 21, 2024
Building ML model for increasing loan acceptance rate by targeting specific customers AI Discussions feedback , ai-discussions , project	21	301	September 11, 2024
Queries regarding Course 2 of ML Specialization Advanced Learning Algorithms general	7	80	June 30, 2024
Seeking Advice: Leveraging ML Skills for Real-World Impact AI Discussions careers	5	162	February 11, 2024
XGBoost for house price competitions Advanced Learning Algorithms week-4	9	608	April 15, 2023

Loan Processing | Kaggle Community Competition

Related topics