Question about Non-max suppression algorithm

Dear friend and mentor,

I want to confirm my understanding about Non-max suppression algorithm. In the slides, it shows the steps are:

step 1. Discard all boxes with P_c ≤ 0.6

step 2 While there are any remaining boxes:
Pick the box with the largest P_c
Output that as a prediction.

step 3 Discard any remaining box with
IoU ≥ 0.5 with the box output
in the previous step

Here is a simple example, lets assume I have 5 boxes (5 P_c) here, and they are [0.2, 0.5, 0.6, 0.7, 0.9].
After step 1, the 0.2 and 0.5 will be gone. leftover are 0.6, 0.7, 0.9

step 2: box 0.9 is the highest in the leftover. So, 0.9 box here will be picked as the ground truth one (which is “Output that as a prediction.”) this step is correct?

step3: the box of 0.6 will do IoU with 0.9, and box of 0.7 will do IoU with 0.9, whichever IoU >= 0.5 will be deleted. (I am not following here actually)

Q1, for the Non-max suppression algorithm, you can use small grid cells to detect the object (assume just one here), the first target is to find the best cell (only one will be selected), once this grid cell is found, you found the middle point. then you found the bounding box (by using that middle point, and box size b_w, b_h). Is this correct?

Q2, in the step 2, “Output that as a prediction” means the ground truth box, Is this correct? In my example, 0.9 is considered as the true one.

Q3, If Q2 understanding is correct. Why I do need to do IoU with 0.6 and 0.7? I just pick the 0.9, delete all rest, done ! why I need to delete the IoU>=0.5 with the true one again? Yes, if the IoU >=0.5 with true one, I should ignore it since it’s repetition. But we already knew the true one is 0.9, right?

Thank you!!

Hello!

Yes

Yes

Yes. If IoU between box 0.6 and box 0.9 is >= 0.5, this means that there is a chance that both boxes are referring to the same object, so you want to keep only the box that has a higher probability (which is box 0.9), and drop the others (which is box 0.6). If IoU between box 0.7 and box 0.9 is < 0.5, then there is a chance that these boxes are referring to two different objects, so we want to keep the box 0.7. After we have a list of boxes to keep, we pass it back to step 2 and then step 3 which will end up in a shorter list, and then step 2 and 3 again and again, until the list is empty.

I think the non-max suppression algorithm doesn’t concern how you detect the object. The neural network (such as YoLo) concerns about how to detect objects. The concern of the non-max suppression is for us to drop redundant prediction boxes that talk about the same object, and keep those who are likely to be representing different objects. The idea is as simple as, now my photo has 2 objects only, but YoLo gives me 10 boxes, then we hope non-max suppression to filter 8 out and keep the two that are well locating my objects.

It means box 0.9 is the prediction, no matter this is a correct prediction or not. A NN can predict wrongly, right? If the prediction is wrong, then it is not the ground truth, right? So my answer is “not correct”.

Because the non-max algorithm doesn’t assume there is only one object in a photo.

Cheers,
Raymond

Hi @rmwkwok Thanks for the help :slight_smile:

  1. You said “After we have a list of boxes to keep, we pass it back to step 2 and then step 3 …” ha, this is the part I missed. I thought we only go though those steps once. Now I know, there is a loop between step 2, and 3, until this list is empty.

  2. just double check, when I say " in the step 2, “Output that as a prediction” means the ground truth box, Is this correct? In my example, 0.9 is considered as the true one." 0.9 is just considered as the true (this true one always has the highest P_c in the list) one, use this true one to do IoU with others, am I correct?

One thought I would add is to be cautious in the use of ground truth. For each object in a training image, there is exactly one ground truth bounding box and there is no ambiguity about where it is. This is your y data, which is known a priori, not predicted by the neural network. All of the bounding boxes being evaluated during non-max suppression are network outputs, or predictions, your \hat{y}. In YOLO there can potentially be many predicted bounding boxes per object, and NMS is used to prune that down to a most likely single output.

\lt soapbox > There are a few recent threads in this forum where people advocate running NMS separately per object class. This means you could still have multiple predictions per actual object after completion of the NMS step with no obvious way to resolve. Kinda defeats the purpose of running NMS. :thinking: <\soapbox>

Cheers

oh thanks sir. I am still watching the videos right now. I may come back when I do the coding.

But my understanding above is correct? Are they correct?

As @rmwkwok points out above, this is not precise. Though I would go further than he did and state that a prediction is not ground truth regardless of whether it is “right” or “wrong.” They are two different concepts.

Restated, ground truth is not a prediction. Conversely, a prediction is not ground truth. If the location and shape of a predicted bounding box is exactly congruent with a ground truth box, that is IOU == 1.0, then it is an accurate prediction. But it will never be ground truth, because that is part of the labelled training data, which a prediction never is.

In this course NMS is a downstream post-processing step run on network output predictions, or candidates, to try to reduce the likelihood of duplicates being passed on to any decision support (eg turn left, apply brakes…)

1 Like