I’m currently working on a machine learning project and would love to get suggestions from the community.
The dataset I have covers a large number of installations, with the following features:
Age of installation
Incident/accident history
Audit score (with many missing values)
My goal is to develop a predictive model that estimates a “risk quote” — essentially a percentage score that reflects the likelihood of risk and helps guide preventive actions.
The challenges I’m facing are:
Handling missing audit scores effectively
Choosing the right modeling approach (classification vs regression, traditional ML vs deep learning)
Designing a pipeline that balances interpretability with predictive power
I’d appreciate suggestions on:
Which tools/frameworks (e.g., scikit-learn, PyTorch, TensorFlow, XGBoost) might be most suitable
Best practices for data preprocessing and feature engineering in this context
Recommended model architectures or algorithms for risk prediction problems with mixed data types and missing values
Any insights, references, or shared experiences would be incredibly helpful. Thanks in advance for your guidance!
Congrats on kicking off your project. Please take some time to explain the domain to get more people to respond to you. Did you mean to create this post under Machine learning specialization ? Adding @TMosh for confirmation since these questions align better with ML.
Choice of framework depends on project constraints. If there are no constraints, please try on the problem at hand to make a decision. For tabular data, start with machine learning algorithms before deep learning approaches. When it comes to unstructured data like text, audio and video, use deep learning algorithms. See skorch as well.
Data preprocessing (including handling missing values) depends on the underlying features. There is no single method that can be applied to all features. Feature selection depends on the number of input features you start with, amount of cpu / ram you have at disposal to explore different subsets of features and constraints imposed on the final model.
There are no reference architectures for regression / classification problems involving tabular data. That said, try kaggle.com for similar problems and sample solutions.
As far as modelling your problem is concerned, if the model has to predict a continuous value, make it a regression problem.
Interesting problem — risk scoring with sparse audit data is a classic mixed-data challenge, and the approach you choose will depend heavily on how you want to balance interpretability, robustness, and operational constraints.
A few suggestions:
1. Handling missing audit scores
Before imputing, check why they’re missing. In many operational datasets, “missingness” itself is predictive (e.g., sites that skip audits often correlate with higher risk). You can:
Add a missingness indicator feature
Use target-encoded or median imputation, but preserve the fact that the original value was missing
Try models that handle missing values natively (e.g., XGBoost, CatBoost)
2. Modeling approach
If you want a continuous “risk quote,” regression is the natural choice. However, you might also frame it as:
Binary classification (“high risk vs low risk”) and then map probabilities to a risk score
Ordinal regression if risk tiers exist
Tree-based models (XGBoost, CatBoost) tend to perform extremely well for tabular risk data, especially with mixed feature types.
3. Interpretability vs predictive power
Start with something interpretable (Logistic Regression + SHAP), then move to more powerful models (XGBoost / LightGBM). SHAP values work well for explaining complex models and are widely used in regulated industries.
4. Tools/frameworks
scikit-learn for baseline models and pipelines
XGBoost or CatBoost for production-grade performance
Deep learning (PyTorch, TF) only makes sense if you have very large datasets — tabular DL rarely outperforms boosted trees unless feature interactions are extremely complex
5. Pipeline architecture
Feature normalization (if using linear models or neural nets)
Missing-value handling + missingness flags
Cross-validation with grouped splits (if installations have repeated history)
SHAP or feature importance for model explainability
If you share a bit about dataset size or class imbalance, I can help narrow down the exact model choice.
Thanks a lot for the detailed suggestions — they’re super helpful!
To give you more context:
- The full dataset has ~10,000 installations, but I can share a smaller sample of ~100 rows for illustration.
- Features:
• Age of installation (numeric, 1–40 years)
• Incident/accident history (count of past incidents, skewed with many zeros)
• Audit score (numeric, but ~70% missing)
- Target: a continuous “risk quote” between 0 and 1.
- Constraints: interpretability matters since this will be used in safety audits, but we also want strong predictive power.
Here’s a small sample dataset attached for reference. The full dataset is much larger but has the same structure.
code_installation
code_site
name_installation
age
incident/accident
audit score
INS-0000184
SITE-000156
name1
2017
4
9%
INS-0000185
SITE-000148
name2
1959
0
INS-0000186
SITE-000149
name3
1997
0
14%
INS-0000187
SITE-000150
name4
2013
0
INS-0000188
SITE-000151
name5
2015
1
17%
INS-0000189
SITE-000152
name6
2015
0
INS-0000190
SITE-000153
name7
2022
0
17%
Would love your thoughts on whether regression vs classification makes more sense here, and if XGBoost/CatBoost would be the right starting point given the missing audit scores.