Course 4, week 3, programming assignment 1: non-max suppression and multiple classes

Hi,

There is something that I do’nt understand about the implementation of the yolo_non_max_suppression() function in course 4, week 3’s first programming assignment.

This function takes as input the output from yolo_filter_boxes(), i.e., three tensors that represent all boxes that remain after the box-score filtering that happens in yolo_filter_boxes():

  • boxes: a tensor of shape (n, 4) containing the box coordinates
  • scores: a tensor of shape (n,), containing the class probability scores
  • classes: a tensor of shape (n,), containing the class indices

So here we have all the boxes that pass the filtering threshold, as well as their scores and the classes they belong to.

Now, here is what I don’t understand. If this information is passed to yolo_non_max_suppression(), how is the NMS algorithm going to distinguish between boxes belonging to different classes?

I understand how the elimination works if there is only one class to detect, but what if there are multiple, as is the case here? In the lecture videos, Andrew says that in the case of multiple classes, you have to independently carry out NMS one time on each of the classes - which makes sense. So I assume one would run NMS as many times as there are classes, each time applying it to the boxes of one class only. We have the classes tensor, so we have the class information for the boxes.

However, I don’t see how tf.image.non_max_suppression() does this. It does not take any class information as input, and I don’t see how it would manage to ‘ignore’ boxes for different classes that have a high IoU.

An example to make it extra clear: suppose we have two boxes, both belonging to the same class, and with a very high IoU. In this case, the lower-score box is removed. But if these two boxes would not belong to the same class, they should both be kept (as I understand it, this situation should not even be considered by the NMS algorithm). How is this achieved?

I hope I made myself clear - any insights are greatly appreciated!

Have you seen this ?

YOLO has different versions with different implementations.

V3 Keras version has a same function, yolo_eval(), as this exercise, but it calls tf.image.non_max_suppression() by class. So, tf.image.non_max_suppression() is not class-aware, but yolo_eval() covers.
And, recent implementations use multi-class non-max suppression in stead. That’s tf.image.combined_non_max_suppression().

This exercise is based on a slightly old version of YOLO, the above code may not be in here.

@balaji.ambresh I had not found this one in my search, thank you for pointing me to it!

@anon57530071 Aha… Let me see if I get this right. So you are saying that

  • In our (example) programming exercise, which seems to be a bit simplified, we do indeed not carry out NMS independently for the classes - but in the real world, when using Keras’s yolo_eval(), NMS is actually carried out independently.
  • In both our example scenario and the real-world one, tf.image.non_max_suppression() is not class-aware (meaning that I understood this correctly in my initial question).

Thanks!

First of all, we are not discussing about the real-world implementation at all. That’s important point.

In the case of a real-world implementation, there are lots of stakeholders. The architecture decision points are sometimes different based on objectives, business needs, time-to-market, resource constraints, latency requirement, and others.
If we think about driving a car, the most important thing is whether it is a “drivable path” or not. “Cats”, “Dog”, “Person”,… all are “non drivable path”. So, using a single class nms, like this exercise, may be one of choices.

Then, let’s go back to your question. Seeing is the fastest way to learn. :slight_smile:
Here is the implementation of Yolo V3 Keras. (Again, this is just one implementation of a long history of Yolo family.)

    for c in range(num_classes):
        class_boxes = tf.boolean_mask(boxes, mask[:, c])
        class_box_scores = tf.boolean_mask(box_scores[:, c], mask[:, c])
        nms_index = tf.image.non_max_suppression(
            class_boxes, class_box_scores, max_boxes_tensor, iou_threshold=iou_threshold)

Apparently, tf.image.non_max_suppression() is not expected to handle object classes in there.
I have my implementation of Yolo, and use tf.image.combined_non_max_suppression() just like a recent version of Yolo for more efficient bounding box handling.

So, you have both single class nms and multi-class nms. Depending to your objectives, you can select either. (By the way, the original version of Yolo is written in C.)

Hope this helps.

@anon57530071 Yes, I think I got it now. Now it’s actually pretty obvious :slight_smile: It would have been nice if this had been mentioned in the programming exercise (since the lectures mention the need for a per-class handling), then I wouldn’t have confused myself so much!

Thanks again!