In YOLO algorithm, each grid cell is responsible for predicting objects whose midpoint falls into it and the anchor box that has the highest IoU with the ground truth box is responsible for detecting an object. In this case, why the duplicate predictions of the same object can still happen (as shown in the following picture)?
I can understand a single grid cell can predict multiple boxes defined by the number of anchor boxes for different (overlapping) objects, and after Non-max suppression, both of them may be kept (as long as they have different anchor box shape → lower IoU). Like the below picture:
Hi Zihan!
Duplicate predictions of the same object can occur due to a combination of factors.
The anchor boxes used in YOLO have different shapes and aspect ratios. In scenarios where multiple anchor boxes have high IoU (Intersection over Union) with the ground truth box, the algorithm may produce duplicate predictions for the same object. This is because each anchor box is responsible for predicting objects of certain sizes and aspect ratios, and it’s possible for multiple anchor boxes to closely match the ground truth object.
YOLO applies a confidence score to each bounding box prediction to indicate the likelihood of containing an object. During the prediction phase, the algorithm considers all bounding boxes with confidence scores above a certain threshold as potential detections (The one you would have put, like 0.5 etc). If multiple overlapping bounding boxes have confidence scores above the threshold, they may all be considered as separate predictions, resulting in duplicates.
Hope you get the point, if not feel free to post your queries.
Regards,
Nithin
I am not convinced that this quite explains the OP question.
For each ground truth bounding box there is only one grid cell that contains its center location, and only one anchor box with the highest IOU. Therefore, only one grid cell + anchor box location will be assigned data for that ground truth bounding box in the training input Y. There is no ambiguity between grid cells, anchor boxes, and ground truth.
And if the predicted output was 100% exactly correct, there would never be any ambiguity in the predicted outputs, either. That is, \hat{Y} would exactly match Y in terms of which locations had non-zero predictions. However, the network output is not 100% accurate, and sometimes detectors will output predictions that they shouldn’t. I believe this is most likely to happen when the center of the object is near a grid cell border and features of the object are in more than one grid, or when the predicted object size doesn’t match well to one anchor box. In this case one or more grid cell + anchor box detectors may fire erroneously. This is the condition that non-max suppression is used to correct for; two predicted bounding boxes with almost the same location and shape are considered to be a mistake.
I also think a well chosen set of anchor boxes wouldn’t likely have two shapes so close to the same that they would be a cause of duplicates occurring, though I haven’t done experiments to prove this…just my intuition from the experiments on deriving anchor boxes that I did for other purposes. Hope this helps.
In my opinion, I feel that it highly depends on the dataset we have. Suppose we have a dataset of images of someone’s face taken from a webcam and he had just taken pictures of his face tilted at various angles but still centred in front of the camera, then there might be multiple bounding boxes closer to the actual one (if we follow the method they used in YOLO9000 -KMeans Clustering (on training set bounding boxes) to estimate the dimensions of the anchor boxes rather than handpicking them, now it becomes dependent on k too. I have observed this in one of my experiments, but not a concrete one to prove anything --just an observation. For example, consider this case (Ground truth bounding boxes):
I might be wrong! Still trying to understand YOLO to the fullest, so please do share your thoughts on this, would be really helpful for me.
Looks like k=3 is a good start for that data set. One small and square. One larger and rectangular, taller than wide, where the majority of the ground truth boxes seem to fall. One larger still. Picking k is an engineering tradeoff, right? More k- means more coherence between ground truth and anchor boxes but at the cost of larger memory footprint and computation. At some point the diminishing marginal increase in coherence is outweighed by the additional cost. Optimal configuration always depends on the application context.
Since YOLO is optimized towards throughput, not clear to me that having a bunch of nearly identical anchor boxes producing a lot of duplicate predictions - false positives - that have to be disambiguated downstream is a good choice. Ultimately the proof is in the runtime performance metrics…use what works best.