Dependence of recall/precision on data set size

At the end of week 3, the discussion was on using recall/precision for a skewed data set. I understand the reasoning behind this but am wondering about the influence of the data set size (or rather, the data set composition) on these measures. Let’s stick with the lecture example of a rare disease we are trying to diagnose. Clearly if I am trying to emphasize recall (sensitivity) I should include as many cases as possible of the disease in my data set. What happens if I have vastly more non-disease samples in my data set (which is natural for a rare disease)? How does the ratio of healthy/diseases samples in my data set influence the training, error metrics, and their uncertainties?

Hi @Martin_Fischer
Welcome to the community!

  • If you want to predict very rare label( by 1) and you want to be confidient as most of data is labeled by 0, so you can increase the threshold to make the prediction more confidient also use the f1 score to evaluate your model correctly(to trade off between the precision and recall) and you can use the confusion matrix to know which label or which prediction the model fail to predict correctly.
  • Also you can use the anomaly detection in case there are label is very rare, you would learn it in the next course

Happy Ramadan :blush:
Best Regards,

We don’t use a single metric. The F1 score combines a couple of different metrics to take this into consideration.