Hi, I am a bit confused about the yolo_non_max_suppression implementation in the yolo programming assignment. In the function implementation, it seems to me that the non_max_suppression does not care about whether the bboxes belong to the same class. Instead, it simply sorts sort/prune bboxes according to their scores. I wonder if I am understanding the implementation correctly. If so, why would the implementation be like this, rather than doing non_max_suppression in a per-class fashion? Thank you very much!
My take is the NMS assumes that if two bounding boxes are in the same place and have the same shape, they are the same object and you only need to keep one of them. If they are in the same place but have different shape they are not the same object and you keep them both. Not clear to me how including the class prediction - note it’s just a prediction or confidence level, you don’t know for sure what the class is - improves or changes that result. If two bounding boxes are determined to be for the same object they are by default for the same class, no?
Thanks! For example, if a dog is behind a person, and there are two tall bboxes at the around the same place detecting the dog and the person. If nms is performed irregardless of the two classes, would it be possible that the dog bbox could be deleted?
The shortest answer is ‘Yes, that could happen.’ HOWEVER, this is where anchor boxes come in to play. Anchor boxes are chosen based on common shapes in the training data. If you’re not familiar with how that is done, see the link below. Given their respective shapes, it is very likely that dogs and people are assigned to different anchor box shapes during training. Therefore also likely that a dog and a person with centers in the same location in the image are each predicted separately at run time. Further, if the bounding box predictions are at all accurate, the boxes are not the same shape and have a low IOU. As a result, both would survive NMS. If a dog is sitting on the lap of a person sitting down, and the bounding boxes are almost the same location and shape, then only the one with the highest confidence score would be kept.
The intent of grid cells is to allow detection of multiple objects in an image without running forward propagation more than once. The intent of anchor boxes is to allow detection of multiple objects at the same location of an image without running forward propagation more than once. It works well when the multiple objects are of different class; dog and person, person and car. But the model breaks down with multiple objects of the same size at the same location. Then, only the ‘best’ will survive the pruning. Hope this helps.
Related thread:
Another thing to remember is the YOLO is optimized for speed. It was designed to be competitive with state of the art regarding accuracy, but raise the bar on throughput so as to enable practical (near real-time) object detection. Would running NMS per class add some degree of accuracy? Perhaps. Would running NMS multiple times reduce throughput? Undoubtedly. There were 80 classes in the COCO data YOLO was initially developed on ( hence the factor of 80 in the output shape). IIRC there are 1,000 classes in the current ImageNet data. That’s potentially a lot of additional computation to get through at 40 frames per second.
Thank you very much for your detailed explanation. This really helps a lot!!
I don’t think anchor boxes play a role in NMS, or do they? Even if the dog and the person fall in different anchor boxes, their bounding boxes may well have a high IOU and the one with a lower score can get dropped from getting recognized. That would not happen in a class aware implementation of NMS
Anchor boxes play an important role in accurate bounding box predictions. Accurate bounding box predictions play an important role in having high IOU only between multiple predictions of the same image object. So they impact indirectly. But NMS is run only on predicted bounding boxes. By the way, don’t overlook that class is also a prediction and can be wrong. So those two highly co-located bounding boxes, one that is a dog and one that is a person? No guarantee there actually are two different types of objects there. YOLO designers made the simplifying assumption that if one or more bounding boxes are highly overlapping, then treat them as a single object and keep only the one with the highest confidence. In my opinion it is a reasonable choice. Could non-class aware NMS ever get it wrong and drop an object it shouldn’t? Yes. Could class-aware NMS ever get it wrong and keep two bounding boxes for a single object? Yes. Which one is worse? Depends on the problem you’re using YOLO to address.
yeah kind of get what you are saying, this is all probabilistic / estimations in the first place. and you have to do a trade-off somewhere. I was reading few blogs and most of them do not care about doing a class specific NMS