Introducing DataSanity โ A Free Tool for Data Quality Checks + GitHub Repo! 
Hey DL community! 
I built DataSanity โ a lightweight, intuitive data quality & sanity-checking tool designed to help ML practitioners and data scientists catch data issues early in the pipeline before model training.
Key Features
Upload your dataset and explore its structure
Automatic detection of missing values & anomalies
Visual summaries of distributions & outliers
Quick insights โ no complex setup needed
Try it LIVE:
https://datasanity-bg3gimhju65r9q7hhhdsm3.streamlit.app/
Explore the code on GitHub:
GitHub - JulijanaMilosavljevic/Datasanity: DataSanity is a dataset health and ML strategy assistant for tabular machine learning.
Built with Streamlit and easy to extend โ contributions, issues, and suggestions are welcome!
Would love your thoughts:
What features are most helpful for you?
What data quality challenges do you face regularly?
Letโs improve data sanity together! 
โ A fellow data enthusiast
1 Like
Congrats on your project.
Hereโs some feedback after trying the streamlit app with iris.csv with species as the target column:
- There is no explanation of sections like
Dataset health score and Risk level
Modeling advice section provides generic text and is not specific to the problem. The dataset is low dimensional small dataset. User feedback should reflect that.
- Xgboost regressor shouldnโt show up on a classification starter code.
Have you looked at solutions like Google looker and kaggle dataset viewer?
1 Like
Thank you so much for taking the time to test it and provide detailed feedback โ I truly appreciate it.
Youโre absolutely right:
-
I need to clearly explain how Dataset Health Score and Risk Level are computed. Iโll add a methodology section describing the scoring logic and thresholds.
-
The modeling advice should adapt to dataset characteristics (size, dimensionality, task type). Iโll improve the logic so feedback is context-aware rather than generic.
-
Good catch on XGBoost regressor appearing in classification starter code โ thatโs a bug on my side and will be fixed.
Regarding tools like Google Looker and Kaggle dataset viewer โ yes, Iโve looked at them. My goal is to build something lightweight and ML-focused, specifically targeting pre-model sanity checks rather than BI-style dashboards.
Thanks again โ this kind of feedback really helps improve the project.
1 Like
Data distribution can be displayed using histograms along a feature and hence the question. Also, a guessed data type can be incorrect.
A few more points:
- Canโt you perform a test like ANOVA to determine the correlation between a categorical target variable and a numeric predictor?
- Time series has a lot of ways to look at the problem. I think itโs nicer to get input from the user if itโs a time series problem and provide feedback accordingly.
- Threshold for id column seems odd (why .98 and not 1) ?
1 Like
Thank you again for the detailed feedback and suggestions.
These are very helpful and I appreciate you taking the time to explore the tool.
I agree that features like ANOVA-based feature analysis and better handling of time-series datasets would make the tool more useful. Iโll keep these ideas in mind for future improvements.
At the moment Iโm focusing on other work, but I plan to revisit and refine the project later.
Thanks again for your insights!