Non-max suppression C4W3 assignment (car detection with YOLO)

The shortest answer is ‘Yes, that could happen.’ HOWEVER, this is where anchor boxes come in to play. Anchor boxes are chosen based on common shapes in the training data. If you’re not familiar with how that is done, see the link below. Given their respective shapes, it is very likely that dogs and people are assigned to different anchor box shapes during training. Therefore also likely that a dog and a person with centers in the same location in the image are each predicted separately at run time. Further, if the bounding box predictions are at all accurate, the boxes are not the same shape and have a low IOU. As a result, both would survive NMS. If a dog is sitting on the lap of a person sitting down, and the bounding boxes are almost the same location and shape, then only the one with the highest confidence score would be kept.

The intent of grid cells is to allow detection of multiple objects in an image without running forward propagation more than once. The intent of anchor boxes is to allow detection of multiple objects at the same location of an image without running forward propagation more than once. It works well when the multiple objects are of different class; dog and person, person and car. But the model breaks down with multiple objects of the same size at the same location. Then, only the ‘best’ will survive the pruning. Hope this helps.

Related thread: