C1W2 Ungraded Lab: misclassification calculation

In the C1W2_Ungraded_Lab_Birds_Cats_Dogs.ipynb, in the cell right below the Confusion Matrix, I wouldn’t say the misclassifications were computed the right way.

Let’s take the misclassification rate of Birds as an example. In my view, it is a situation when we submit Birds to a model (one by one) and count wrong predictions (i.e., when real Birds are predicted as dogs or cats). The fraction of the number of wrong predictions (numerator) and the total number of Birds submitted for the model classification (denominator) should be the misclassification rate of Birds, IMHO.

In Python: ((y_true == 0) & ( (y_pred_imbalanced == 2) | (y_pred_imbalanced == 1) )).sum() / (y_true == 0).sum()

The formula in the notebook considers all Birds predictions and counts how many times the ground truth was cat or dog rather than a bird.

In Python: ((y_pred_imbalanced == 0) & ((y_true == 2) | (y_true == 1))).sum() / (y_pred_imbalanced == 0).sum()

Discussion: In my view, misclassification rate is False Negative Rate.
The formula in the notebook calculates False Discovery Rate.
Regarding the terminology, see Wikipedia.

Your view, guys?

Thanks for pointing this out. I’ve asked the staff to fix it.

Please use this snippet:

misclassified_birds = (imbalanced_cm[0, 1] + imbalanced_cm[0, 2])/np.sum(imbalanced_cm, axis=1)[0]
misclassified_cats = (imbalanced_cm[1, 0] + imbalanced_cm[1, 2])/np.sum(imbalanced_cm, axis=1)[1]
misclassified_dogs = (imbalanced_cm[2, 0] + imbalanced_cm[2, 1])/np.sum(imbalanced_cm, axis=1)[2]

print(f"Proportion of misclassified birds: {misclassified_birds*100:.2f}%")
print(f"Proportion of misclassified cats: {misclassified_cats*100:.2f}%")
print(f"Proportion of misclassified dogs: {misclassified_dogs*100:.2f}%")

Notebook has been updated, thanks to @zbynekb for flagging and @balaji.ambresh for coming up with the solution :slight_smile:

And doesn’t code for the confusion matrix also have to be changed? The confusion matrix image seems incorrect in light of the new snippet above.

It would seem to me that the arguments in the code:

imbalanced_cm = confusion_matrix(y_true, y_pred_imbalanced)

should be reversed in order to align:

imbalanced_cm = confusion_matrix(y_pred_imbalanced, y_true)

The code provided in the notebook is correct according to the latest scikit-learn api version 1.5.1 (see confusion_matrix):

Here’s an example:

import numpy as np
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

class_labels = ['birds', 'cats', 'dogs']
y_true = ["dogs", "dogs", "dogs", "dogs", "dogs",
          "cats", "cats", "cats", "cats", 
          "birds", "birds", "birds", "birds"]
y_pred = ["dogs", "dogs", "cats", "birds", "dogs",
         "cats", "dogs", "dogs", "dogs",
         "birds", "birds", "birds", "dogs"]

imbalanced_cm = confusion_matrix(y_true, y_pred, labels=class_labels)
cmd = ConfusionMatrixDisplay(imbalanced_cm, display_labels=class_labels)
cmd.plot()

# from notebook
misclassified_birds = (imbalanced_cm[0, 1] + imbalanced_cm[0, 2])/np.sum(imbalanced_cm, axis=1)[0]
misclassified_cats = (imbalanced_cm[1, 0] + imbalanced_cm[1, 2])/np.sum(imbalanced_cm, axis=1)[1]
misclassified_dogs = (imbalanced_cm[2, 0] + imbalanced_cm[2, 1])/np.sum(imbalanced_cm, axis=1)[2]

print(f"Proportion of misclassified birds: {misclassified_birds*100:.2f}%")
print(f"Proportion of misclassified cats: {misclassified_cats*100:.2f}%")
print(f"Proportion of misclassified dogs: {misclassified_dogs*100:.2f}%")

Proportion of misclassified birds: 25.00%
Proportion of misclassified cats: 75.00%
Proportion of misclassified dogs: 40.00%

# I prefer this
total_instances = imbalanced_cm.sum(axis=1)
correct_classifications = np.diag(imbalanced_cm)
misclassifications = total_instances - correct_classifications
print(misclassifications * 100 / total_instances)

[25. 75. 40.]

Am I missing something?

This example was very helpful. The issue is on my end. I misinterpreted the phrase “Proportion of misclassified birds” etc. to mean the proportion of animals misclassified as birds rather than the proportion of birds that have been misclassified as other animals. As a result, I was reading off the confusion matrix columns rather than the rows.