Dependence of recall/precision on data set size

Martin_Fischer · April 16, 2023, 8:27pm

At the end of week 3, the discussion was on using recall/precision for a skewed data set. I understand the reasoning behind this but am wondering about the influence of the data set size (or rather, the data set composition) on these measures. Let’s stick with the lecture example of a rare disease we are trying to diagnose. Clearly if I am trying to emphasize recall (sensitivity) I should include as many cases as possible of the disease in my data set. What happens if I have vastly more non-disease samples in my data set (which is natural for a rare disease)? How does the ratio of healthy/diseases samples in my data set influence the training, error metrics, and their uncertainties?
Thanks,
Martin

AbdElRhaman_Fakhry · April 16, 2023, 9:01pm

Hi @Martin_Fischer
Welcome to the community!

If you want to predict very rare label( by 1) and you want to be confidient as most of data is labeled by 0, so you can increase the threshold to make the prediction more confidient also use the f1 score to evaluate your model correctly(to trade off between the precision and recall) and you can use the confusion matrix to know which label or which prediction the model fail to predict correctly.
Also you can use the anomaly detection in case there are label is very rare, you would learn it in the next course

Happy Ramadan
Best Regards,
Abdelrahman

TMosh · April 16, 2023, 9:44pm

We don’t use a single metric. The F1 score combines a couple of different metrics to take this into consideration.

Topic		Replies	Views
C1W2 Metrics for skewed datasets: Precision or Recall Machine Learning in Production	1	555	April 24, 2022
A question about high precision and hign recall Advanced Learning Algorithms week-2	5	478	March 27, 2023
Rare Disease Classification example Advanced Learning Algorithms week-3	1	471	February 12, 2023
Precision and recall Advanced Learning Algorithms week-3	5	689	May 9, 2023
Precision and recall questions Advanced Learning Algorithms week-3	4	46	September 11, 2024

Dependence of recall/precision on data set size

Related topics