More than you ever wanted to know about YOLO and NMS.
NMS prunes duplicates
Non-max suppression is used in YOLO to suppress predictions likely to be of the same object. It ignores the class predictions and uses a Jaccard Similarity Coefficient, aka IOU, and uses only the two predicted center locations and shapes transformed into bounding boxes.
If the IOU between two predicted bounding boxes is sufficiently high then they are treated as if they are same object and only one of the two, the one with the highest confidence, will be retained.
Here’s why…
Why duplicate predictions might occur
Multiple predictions of the same object can occur both from multiple anchor boxes in the same grid cell or from predictions in neighboring grid cells that both position the object center within their grid (NOTE: one of these is incorrect - the actual center can only be in one grid cell at a time).
Why use IOU alone for duplicate detection?
First, consider the case of low IOU. This means either the locations are different, or the shapes are different, or both. If the locations and shapes are disjoint, the IOU is 0, they must be different objects, regardless of class, so keep them both. If the locations are similar but the shapes are sufficiently different that IOU is low, they must be different objects, so keep them both. Again, regardless of class. This is what enables YOLO to detect a Person standing in front a Car, for example. Same center location prediction, different shapes. Only if the location and shape are both sufficiently similar that IOU is high, assume they are duplicates and keep only the one with the highest confidence even if the predicted class of the two objects is different. In the limit that IOU is 1, the location in the image is identical and the shape is also identical; the prediction is that they share the same pixels. In this case, they are in effect superimposed on each other, so one is occluded; even if the class predictions are different, keep only the one with the highest confidence. The simplifying assumption is that if two bounding boxes contain the same (or almost the same) pixels, they must enclose the same object.
Doesn’t this make some mistakes?
Yes, but YOLO is optimized on frame rate throughput. Small degradation of accuracy is acceptable in the name of speed. Its an engineering tradeoff, where this approach was deemed to have more benefit (pruning of true duplicates) than cost (false positive duplicate designation). All the thresholds are parameters that can be tuned empirically based on a confusion matrix. HTH