Hi,
There is something that I do’nt understand about the implementation of the yolo_non_max_suppression()
function in course 4, week 3’s first programming assignment.
This function takes as input the output from yolo_filter_boxes()
, i.e., three tensors that represent all boxes that remain after the box-score filtering that happens in yolo_filter_boxes()
:
-
boxes
: a tensor of shape (n
, 4) containing the box coordinates -
scores
: a tensor of shape (n
,), containing the class probability scores -
classes
: a tensor of shape (n
,), containing the class indices
So here we have all the boxes that pass the filtering threshold, as well as their scores and the classes they belong to.
Now, here is what I don’t understand. If this information is passed to yolo_non_max_suppression()
, how is the NMS algorithm going to distinguish between boxes belonging to different classes?
I understand how the elimination works if there is only one class to detect, but what if there are multiple, as is the case here? In the lecture videos, Andrew says that in the case of multiple classes, you have to independently carry out NMS one time on each of the classes - which makes sense. So I assume one would run NMS as many times as there are classes, each time applying it to the boxes of one class only. We have the classes
tensor, so we have the class information for the boxes.
However, I don’t see how tf.image.non_max_suppression()
does this. It does not take any class information as input, and I don’t see how it would manage to ‘ignore’ boxes for different classes that have a high IoU.
An example to make it extra clear: suppose we have two boxes, both belonging to the same class, and with a very high IoU. In this case, the lower-score box is removed. But if these two boxes would not belong to the same class, they should both be kept (as I understand it, this situation should not even be considered by the NMS algorithm). How is this achieved?
I hope I made myself clear - any insights are greatly appreciated!