The non-max supression step should be independently applied for each selected class in order to use the value of anchor boxes. However, it didn’t.
It appears that TF automatically handles this inside the non_max_suppression() function.
- Anchor boxes have shape only, not type (class).
- NMS performs IOU on two predicted bounding boxes at a time and doesn’t involve anchor boxes at all. In fact, since it is a post processing step, the grid cell and anchor box indices associated with each prediction are already gone by the time the list of predicted bounding boxes is passed to NMS. You just have the coordinates and the confidence to work with.
I had the same question (I needed a few more words for it though ): Course 4, week 3, programming assignment 1: non-max suppression and multiple classes
the grid cell and anchor box indices associated with each prediction are already gone by the time the list of predicted bounding boxes is passed to NMS. You just have the coordinates and the confidence to work with
Are you sure about this? In the function we implemented, yolo_non_max_suppression(()
, all information needed to calculate the NMS per class is still there (in the tensors boxes
, scores
, and classes
). It is true though that the class information is not passed to totf.image.non_max_suppression()
- but we still had that information ready. So presumably, the TF implementation of NMS used here actually does not do the NMS per class.
It appears that TF automatically handles this inside the non_max_suppression() function.
I wonder how then. The class information, which is available, is not given to the TF implementation of NMS (tf.image.non_max_suppression()
). Or do you mean that classes are just ignored?
@Reinier_de_Valk the original post includes this in order to use the value of anchor boxes. My assertion was that anchor boxes play no role in NMS - it is based purely on bounding box predictions. I stand by that. I concede that grid cells and anchor boxes could be reverse engineered from the image-relative coordinates and repeating the type of ‘best anchor’ computation done when setting up training data, so in some sense they are available. But they don’t play a role in NMS and the extra computation would degrade throughput.
The drivers of NMS are predicted location and shape, not type. If two object predictions have the same location and have the same shape, the algorithm assumes they are duplicates. In the limit that IOU is 1, the algorithm is finding the two predictions are superimposed - they are literally the same pixels. It doesn’t make sense to keep two object predictions that are comprised of the same image data, regardless of whether or not the class predictions are identical or not.
Another similar thread with more words here… Non-max suppression C4W3 assignment (car detection with YOLO)
One last thought is to remember that this is just engineering, which is all about making acceptable tradeoffs to achieve desired outcomes. Ignoring predicted class in post processing optimizes for throughput and accepts possible loss of accuracy to attain it. If your business case demands differently, or your throughput or confusion matrix isn’t where you need them, you can always make different tradeoffs. Cheers.
Thanks @ai_curious for the extensive replies!
My assertion was that anchor boxes play no role in NMS - it is based purely on bounding box predictions.
You are absolutely right, I confused anchor boxes and bounding boxes (again). Thanks for pointing that out.
It doesn’t make sense to keep two object predictions that are comprised of the same image data, regardless of whether or not the class predictions are identical or not.
Yes, this makes total sense.
I am still a little bit confused, but I guess I will just let all this information sink in for a while and read a bit more about it.
Thanks again for taking the time to answer!
Hello
my question will be about the process of non-max suppression and its implementation by the method tf.image.non_max_suppression().
As far as I understood, we need to perform the maximum overlap filtering,i.e. iou-filtering, for the remaining boxes of different classes independently after the score-filtering process and the iou filtering goes as follows: Find the max scores for each classes and ignore the boxes with iou above the iou_threshold.
However it seems to me that the classes are not considered in the function tf.image.non_max_suppression() since it doesnot take the classes as argument but boxes and scores. So my question how does this function know or perform the class-wise elimination of boxes with high iou?
thanks
It doesn’t. It is not class aware. Two objects with high IOU are deemed to be duplicates, regardless of whether or not the associated class predictions are the same.
EDIT
I don’t think there is any ambiguity about what the current implementation does. However, there is some lack of consensus among members of this community about what it should do. You can find other threads where people advocate for per class. I remain unconvinced. And if it was good enough for Redmon et al, it’s good enough for me
Can the non-class-aware implementation make some errors? Yes, it can prune objects that should be kept. The class-aware approach can make errors too, and leaves you with no tool to disambiguate or resolve duplicates. I think there is a legitimate engineering tradeoff around which problem is more important in your situation. I do not believe it is legitimate to just flat out say per-class NMS is better or ‘correct’ and always preferred. My recommendation is that one should train to get good localization and classification predictions, and then tune IOU and confidence thresholds to reduce the likelihood of over aggressive pruning in NMS. Not everyone agrees. Cheers
I disagree, why would a per class NMS be not better than a non class aware NMS in almost all cases?. At the cost of some extra steps (and which is not too much computation), we are eliminating the risk (even though there may be low probability of that) of say a pedestrian getting dropped from being recognized if their center falls in the same grid cell as say a bicycle.
And if it was good enough for Redmon et al, it’s good enough for me
. Well do we know for sure if in their research paper Redmon et al ignore the classes during NMS? TensorFlow provides a non class aware implementation of NMS, but you can always make it class aware by calling it for a set of boxes for each class independently
hey @paulinpaloalto any thoughts on this thread about the programming exercise implementing NMS in non class aware manner. I understand the TF NMS method is non class aware but in my opinion we should have called it for a set of boxes for each class independently
YOLO was designed primarily to address the need to improve throughput. The designers accepted some reduced accuracy in the pursuit of highest frame rate. They didn’t choose to add extra computation to the pipeline to protect against loss of accuracy in some edge cases. Is that the right choice for every conceivable object detection scenario? Perhaps not. Is it the right choice for the majority of use cases? The authors of YOLO thought so. As did the authors of the TensorFlow library. I’m inclined to trust their judgment. But if you need class-aware NMS for your application, go right ahead and do what the situation calls for.
Sure understand now, thanks for replying