Precision and recall

Hello, after watching videos about precision-recall, I have formulated them like this:
Precision - P(y=1 | ŷ=1) what is the probability of y being 1 if ŷ=1 ? (model predicted 1) .
Recall - P(ŷ=1 | y=1) what is the probability of model predicting 1 when y=1 ?
Is this the correct way to think about precision and recall ?

Hi @saba_odisharia

You are right but for more intuition, the below is called the confusion matrix


In an imbalanced classification problem with two classes, precision is calculated as the number of true positives(TP) divided by the total number of true positives(TP) and false positives(FP), it’s mean that the number of predicted value ŷ =1 & the real value y = 1 divide by number of prediction ŷ=1

  • Precision = \frac{TruePositives} {(TruePositives + FalsePositives)}

Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made, it’s mean the number of predicted values ŷ =1 & the real value y = 1 divide by number of the real value y = 1

  • Recall = \frac{TruePositives} { (TruePositives + FalseNegatives)}

Cheers,
Abdelrahman

2 Likes

I don’t understand the meaning of precision and recall in Andrew’s rare disease example. I can read the formula but I don’t understand the term’s meaning and what we want to compare within each term.

Precision

  • precision = take all patients that have been predicted with the rare disease by the model (ŷ=1; e.g. 15) which is a subset of the total number of actual rare disease patients (15+10) and compare it to the number of patients that have been predicted with the rare disease by the model (15) plus the number of patients that have not been predicted as healthy by the model (5)
  • Precision is focused on the model’s performance I assume.

Recall

  • recall = take all patients that have the actual rare disease (y=1; e.g. 15+10) and compare it to …??? My number (15+10) doesn’t fit the formula’s numerator of 15.
  • Recall is focused more on the actual value I assume.
                           Actual class
                      __________1_______|__________0_______
  Predicted class  1 | True Pos. e.g. 15| False Pos. e.g. 5 | <-- precision
                   0 | False Neg.e.g. 10| True Neg. e.g. 70 |
                               / \
                                |
                             recall

Can someone explain me the two terms based on the example with the rare disease patients please? I need to understand it.

Hi @Daniel_Blossey,

For precision, we can compare the value in its numerator with the value in its denominator.

For recall, we can compare the value in its numerator with the value in its denominator.

We can compare a model’s precision with another model’s precision.

We can compare a model’s recall with another model’s recall.

In your example, for precision, we compare the number of correctly predicted positives (true positive) with the number of predicted positives (true positive + false positive), in other words, we are comparing 15 with (15 +5) and find a ratio of 0.75.

In your example, for recall, we compare the number of correctly predicted positives (true positive) with the number of actual positives (true positive + false negative), in other words, we are comparing 15 with (15 + 10) and find a ratio of 0.60.

They are both performance metrics.

Their names do explain their purposes. I think an English dictionary can do a better job in explaining those two words. :wink:

We can talk about your aim. Aiming at higher precision tends to be more conservative because you don’t want to risk to have many false positives, for example, when you are a bank officer and you are deciding who to lend the money to during an economic depression. Aiming at higher recall tends to be more inclusive because you don’t want to risk to miss out any positives, for example, when you are responsible for your company’s user retention in a very competitive and profitable market.

Raymond

We can talk about your aim. Aiming at higher precision tends to be more conservative because you don’t want to risk to have many false positives, for example, when you are a bank officer and you are deciding who to lend the money to during an economic depression. Aiming at higher recall tends to be more inclusive because you don’t want to risk to miss out any positives, for example, when you are responsible for your company’s user retention in a very competitive and profitable market.

Thanks Raymond, thanks for your answer! Your business case was helpful.

As a summary for a predicting rare disease model I would concentrate on recall=100% because I want to have all actual rare disease patients to be predicted 1 (15+10). If the model is not so precise and if there are some predictions that are actually not 1 having no rare disease (5) patients will survive physiologically apart from psychological stress of a false positive.

Definitions by Andrew from the video and conclusion/interpretation:

  • precision = of all the patients where we predicted ŷ=1 (15+5) → what fraction actually has the rare disease?
  • recall = of all the patients that actually have the rare disease (15+10) → what fraction did we correctly detect as having the rare disease y=1?

Thanks again! I like this course so much.

Hi @Daniel_Blossey,

Yes and yes for the two question marks.

Blindly assigning all patients as 1 can guarantee 100% recall, leaving the precision at its minimum value. This is the worst “model”. We can also build a model to achieve a better precision while maintaining a 100% recall. If we are lucky enough to have some strong features, that precision can be much better, otherwise, we might end up with something not too different from the worst, blind assignment.

Precision and recall are very relevant to information retrieval (IR). If you, by any chance, need to demonstrate them in another scenario, IR can be that.

Cheers,
Raymond