While reading about evaluation metrics for imbalanced classification problem, I came to know that the metrics can be broadly classified into 3 categories - threshold metrics, ranking metrics, probability metrics. The “threshold metrics” category consists of accuracy, error, precision, recall, sensitivity, etc. One of the documents mentioned that “threshold metrics assume full knowledge of the conditions under which the classifier will be deployed. In particular, they assume that the class imbalance present in the training set is the one that will be encountered throughout the operating life of the classifier. which is not often the case so they can mislead you”.
I don’t understand how metrics such as recall can mislead? Whatever be the ratio of positive to negative category examples, these metrics would nonetheless give a measure of how accurately the positive and negative categories were predicted.
Hello @Harshit1097 ,
You are right that recall is a measure of how accurately the positive and negative categories were predicted. However, it is important to note that recall can be misleading in the case of imbalanced classification problems. This is because recall is calculated as the number of true positives divided by the total number of positive examples. In the case of an imbalanced classification problem, there will be many more negative examples than positive examples. This means that a classifier can achieve a high recall simply by predicting all examples as negative. For example, a classifier that predicts all examples as negative will have a recall of 100% if the positive class only makes up 1% of the training data. However, this classifier will have a very low precision, which is the number of true positives divided by the total number of predicted positives. In this case, the precision will be very low, because the classifier will have many false positives.
It is important to use a variety of metrics to evaluate the performance of a classifier on imbalanced classification problems. In addition to recall, you should also consider precision, F1 score, and ROC AUC. These metrics can help you to get a more accurate picture of the performance of your classifier.
Please feel free to reply with a follow up if you have any further questions.
In situations like this I always like to consider the extremes. Suppose your training set was 100% pictures of cats, but the operational environment was 100% pictures of motorcycles. A classifier will learn to always predict cat, and it will always be correct. During training. Operationally it will also always predict cat, however it will always be wrong. From this I get the intuition that the training metrics can mislead me about the usefulness of my model when faced with class imbalance. Too lazy to try, but I’m pretty confident algebra would back this up. Probably a Bayesian proof in there somewhere.