Course 4, Week 3, Assignment 1 (YOLO)

Hello, everyone

Question about function called yolo_non_max_suppression.

PLease correct me if I am wrong, but it seems like this function does non-max suppression within the entire input tensor regardless of the class of the box. In the lecture Andrew said its best to run suppression for each class separately.

Any ideas?

Hi @Omar-Sayan_Karabayev,

I believe your understanding here is correct. This has been brought to our attention before as well, and we are aware, and working on ways of improving it.

Cheers,
Mubsi

I’m not clear on what it is we think needs improvement. Here is how the TensorFlow doc starts the description of NMS: Prunes away boxes that have high intersection-over-union (IOU) overlap with previously selected boxes. Consider the case in the limit where IOU is 1. That means two detectors agree exactly on the object location: they produced the same predicted bounding box. As is, the class prediction is ignored and only the highest confidence prediction is retained. However, if each has a different class prediction and you are running NMS separately, then both predictions are kept. Notice only one of these predictions can be correct; we can’t have two different types of objects occupying exactly the same pixels of an image. But we can’t really call it non-max-suppression if we are not suppressing the inferior confidence predictions, right? To disambiguate these two predictions that have survived the pipeline, a further processing step will be required. I think it is a legitimate question whether NMS overall improves accuracy and precision, but NMS run on classes separately is an oxymoron.

Please look at the original image on the right in this link. Here are the outputs without and with class wise nms.

Without non-max suppression per class
without_classwise_nms

With non-max suppression per class
with_classwise_nms

The 2nd approach is better if we consider a custom yolo model developed for detecting if someone is carrying a prohibited item in an airport. If someone was to hold a disallowed object in front of their body, the 1st approach will fail to flag the person.

Ok, there are some operational environments / requirements that may drive engineering choices other than vanilla NMS. Is it an “improvement” of NMS that needs to be introduced into the YOLO learning exercise? Not clear to me. BTW it looks like that YOLO model produced some pretty bad localization errors. Maybe some better training data and more training would solve the problem because then the person and ‘frisbee’ bounding boxes would have low IOU in the first place.

You are right in pointing out that this yolo model wasn’t fine-tuned.
I took the pre-trained yolo model to get these results. The settings (including minimum threshold) were left as they were in the starter code. No fine-tuning was done for any specific dataset.

The version that does NMS per class requires changes to yolo_non_max_suppression function within the assignment. This is done to invoke NMS on boxes per class rather than 1 invocation of NMS across all detected boxes.

NMS run separately for classes trades one problem, possible suppression of true positives, for another, duplicate predictions. I think the important message for learners is that neither approach is without risks and drawbacks, some of which can be at least reduced if maybe not completely mitigated by targeted training. In the provided example, if occluded objects is an important use case, effort should be made to include them in that training. In every case, pick the algorithm, training data, and training regime that empirically works best, monitor performance, and be prepared to steer as the situation mandates.

Tradeoff between compute and tolerance towards false positives / true negatives should determine the NMS implementation.

Covering implementation of NMS per class as part of the assignment is good since it’s trivial to convert the code to NMS without classes.