Datasanity

:fire: Introducing DataSanity โ€” A Free Tool for Data Quality Checks + GitHub Repo! :magnifying_glass_tilted_left:

Hey DL community! :waving_hand:

I built DataSanity โ€” a lightweight, intuitive data quality & sanity-checking tool designed to help ML practitioners and data scientists catch data issues early in the pipeline before model training.

:backhand_index_pointing_right: Key Features

:check_mark: Upload your dataset and explore its structure

:check_mark: Automatic detection of missing values & anomalies

:check_mark: Visual summaries of distributions & outliers

:check_mark: Quick insights โ€” no complex setup needed

:pushpin: Try it LIVE:

:backhand_index_pointing_right: https://datasanity-bg3gimhju65r9q7hhhdsm3.streamlit.app/

:laptop: Explore the code on GitHub:

:backhand_index_pointing_right: GitHub - JulijanaMilosavljevic/Datasanity: DataSanity is a dataset health and ML strategy assistant for tabular machine learning.

:hammer_and_wrench: Built with Streamlit and easy to extend โ€” contributions, issues, and suggestions are welcome!

Would love your thoughts:

:star: What features are most helpful for you?

:star: What data quality challenges do you face regularly?

Letโ€™s improve data sanity together! :blush:

โ€” A fellow data enthusiast

1 Like

Congrats on your project.
Hereโ€™s some feedback after trying the streamlit app with iris.csv with species as the target column:

  1. There is no explanation of sections like Dataset health score and Risk level
  2. Modeling advice section provides generic text and is not specific to the problem. The dataset is low dimensional small dataset. User feedback should reflect that.
  3. Xgboost regressor shouldnโ€™t show up on a classification starter code.

Have you looked at solutions like Google looker and kaggle dataset viewer?

1 Like

Thank you so much for taking the time to test it and provide detailed feedback โ€” I truly appreciate it.

Youโ€™re absolutely right:

  1. I need to clearly explain how Dataset Health Score and Risk Level are computed. Iโ€™ll add a methodology section describing the scoring logic and thresholds.

  2. The modeling advice should adapt to dataset characteristics (size, dimensionality, task type). Iโ€™ll improve the logic so feedback is context-aware rather than generic.

  3. Good catch on XGBoost regressor appearing in classification starter code โ€” thatโ€™s a bug on my side and will be fixed.

Regarding tools like Google Looker and Kaggle dataset viewer โ€” yes, Iโ€™ve looked at them. My goal is to build something lightweight and ML-focused, specifically targeting pre-model sanity checks rather than BI-style dashboards.

Thanks again โ€” this kind of feedback really helps improve the project.

1 Like

Data distribution can be displayed using histograms along a feature and hence the question. Also, a guessed data type can be incorrect.

A few more points:

  1. Canโ€™t you perform a test like ANOVA to determine the correlation between a categorical target variable and a numeric predictor?
  2. Time series has a lot of ways to look at the problem. I think itโ€™s nicer to get input from the user if itโ€™s a time series problem and provide feedback accordingly.
  3. Threshold for id column seems odd (why .98 and not 1) ?
1 Like

Thank you again for the detailed feedback and suggestions.

These are very helpful and I appreciate you taking the time to explore the tool.

I agree that features like ANOVA-based feature analysis and better handling of time-series datasets would make the tool more useful. Iโ€™ll keep these ideas in mind for future improvements.

At the moment Iโ€™m focusing on other work, but I plan to revisit and refine the project later.

Thanks again for your insights!