Hey guys! While doing the quiz, i found something that makes me question it all the time. Assuming that the input image has only a car. If we have 2 anchor boxes, then each grid cell will have 2 predicted boundary box right? So the grid cell that detect the car because it has the midpoint of the car, has 2 bounding box. If i say that i want to remove all boundary boxes with probability of <= 0.6, and IoU >= 0.5, and take the highest probability of the boundary box as final prediction, then what if IoU < 0.5? We do not remove it right? Then what’s that boundary box used for? Thanks Ahead!
Non-max suppression is an important technology to remove bounding boxes. And, IoU plays the key role for that, since we need to consider two cases.
- Multiple bounding boxes are covering a “single” object.
- There are multiple bounding boxes that overlap each other, but cover “different” objects.
I’m trying to illustrate the above.
Sometimes, the 2nd case is forgotten, but really important.
To keep two bounding boxes independent, we use IOU. So, if IOU is less than a certain threshold (like 0.5), then, it is better to consider the case that two bounding boxes are independent. (case 2)
For the case 1, the process is pretty straightforward. If IOU is larger than a certain threshold value, then, just keep the highest probability bounding box.
And, the above process can be more computational oriented if the number of bounding boxes is huge. In this sense, the first thing to do is to remove low-possible bounding boxes at first. Then, start the non-max suppression process with considering IOU.
Hope this clarifies.
This really helps me! Thank you!
Oh and one more thing @anon57530071 . What if there is a cars that almost overlapped each other? We still remove the bounding box that has IoU greater than the threshold right? Another question, is IoU computed from a ground truth and a predicted bounding box or a bounding box and a bounding box?
If you use Non-max suppression, then, one bounding box will be removed.
There is another technology, called “Soft Non-max suppression”. Here is a link to a paper. (The following picture is also from that paper.)
This soft-NMS keeps two bounding boxes. But, it changes the probability. If IOU is larger, then, relatively large value is subtracted from the probability of “low-probable” bounding box. With this, keep the highest one just like NMS, but also keep the 2nd candidate with lower the probability value. (This implies that the lower one is most likely same object as the higher one, but there is a possibility of different one…)
There is another interesting research work from a different view point. That is to use Transformer, that you will learn at Course 5. This is totally different approach. See this paper, End-to-End Object Detection with Transformers
I haven’t digested this paper yet, but am expecting it works better than existing technologies.
Region shapes are compared 3 times in YOLO; each is different.
During the process of establishing training data, IOU is used to compare ground truth shape with anchor box shapes to assign the ground truth to the correct cell of the network output. (s_i, s_j, best\_anchor,…)
During training itself, inside the loss function, predicted bounding box shape is compared to ground truth shape but doesn’t use IOU (it uses least squares, at least in YOLO v2 code used in this course).
During NMS, you are pruning the list of predicted bounding boxes, so you are iteratively comparing one predicted bounding box (the one with the (next) highest confidence) to other predicted bounding boxes to remove likely duplicates. The assumption is that if two predicted bounding boxes are exactly IOU = 1. or almost exactly IOU > threshold the same, then the two boxes likely enclose the same object (so retain only the prediction with the highest confidence). This is Case 1. above. Many times, nearby objects are of different shape. Then, even if the predicted bounding boxes partially overlap, IOU can differentiate them. Case 2. above. However, note that when there are multiple different objects with the same shape that overlap, YOLO and vanilla NMS might treat them as Case 1. when they are really Case 2. Poor performance on tight clusters of like objects is a known deficiency of early versions of YOLO mentioned by its creators.
Thanks @ai_curious and @anon57530071 for answering my question!