During the presentation, Andrew recommended to also review some of the cases in which the algorithm got the right score :

“Consider examining examples your algorithm got right as well as ones it got wrong.”

He also mentioned though that this data set can be large - for example if the actual accuracy is 98%.

One suggestion that can be useful (and something that I have done in the past) is to look not only at the final score (1 or 0) but consider the last probability or activation that led to the final algorithm output and focus the review on those cases where the algorithm is admitting uncertainty but got the right answer. For example that it got the right score of “1” because it indicated a 51% chance for “1” and a 49% chance for “0”. And vice versa for the wrong answers.

This approach is also interesting for the other extreme…ie. for the cases when the algorithm gave for example a 98% probability for a score but that score turned out to be wrong, then these cases may be of particular importance for review.

1 Like

Yes, you can learn a lot by looking at different types of cases. Great share! Thx.

Does this suggest that maybe the network should be architected to output the probability, rather than the binary outcome? You can always perform the clipping on the probability in post-network processing, but once you output the 1/0 that detail is lost.

I believe he looks at the final activation of the network. I guess he has an argmax function in his predict method. I have never seen a classifier with clipped outputs in the cost function. So the network is not outputting clipped values, but probabilities already:)

Sorry, I didn’t mean to imply clipping in the cost function. I inferred that the network final output was binary, and that backing up to see the probability that resulted in that output required extra work/re-engineering. My question was whether to always directly output the confidence from the network, and produce 1/0 or 1/0/not-very-confident-could-be-either in post network processing.

Maybe we are talking about the same thing I will try again:

The network final output is what I was referring to and it is most likely not binary. Depending on the final activation function, it is either ]0,1[, or a real value if you have the identity function, i.e., what we call logits, or [0, infinity] if you use ReLU for example.

The final output is then used to perform a prediction. You can apply argmax or some other method for deciding your prediction based on this real value.

Given that there was some uncertainty about my post, I will try to express what I was trying to say in a more explicit way.

If we consider a binary problem and softmax as the final layer, the model prediction is the label with the maximum softmax score, so either 0 or 1.

However, the 1 prediction for example could occur when the 1 lablel had a softmax score of 0.9 or 0.51.

So, when manually reviewing reviewing the modeling results (whether for checking mislabeled data or for trying to find ideas for improving the model) it can be sometimes more useful to look at the softmax score values as these may contain more information for these task than only looking at the final model predictions.

And this can be especially useful when there is a lot of data to review.

Focusing the manual review on the errors where the score for the wrongly predicted label was greater than say 0.9 may be more efficient than just taking a random sampling of all of the wrong predictions.

This subset of the input data may contain a higher share of the mislabeled cases. Additionally the if the labeling was actually correct, then these cases may provide particular insights for the areas to focus on for model improvements.

1 Like

It is never wrong to clarify thx for sharing your thoughts