When I make a prediction on an image and run it many times, instead of getting one bounding box for each object the boxes are overlapped. How will I be able to get a single box for each object even if I run it twice or thrice?
This was discussed in the lectures and in the assignment under the heading of “non-max suppression”. That was section 2.4 and exercise 3 of the assignment.
The section covers how to remove the bounding box. I did all those steps and just made the prediction. When I ran the prediction part once, the image gave only one single and correct bounding box for each object. When I rerun this line. The previous boxes don’t eliminate and the new ones overlap the previous boxes. How to eliminate and avoid it so that each time a prediction is run on an image, it gives just one bounding box.
I don’t know the answer, but if I correctly understand what you are saying you get different results the first time you predict versus the second and later time you predict. Well, just as a scientist, one would observe that if you try what you think is the same thing and get different results, it must not really have been the same thing, right? Well, I suppose you could theorize that the whole thing is not deterministic in the first place. But the Occam’s Razor version would be that there must be some aspect of this that is “stateful” or perhaps you are not actually executing the same thing. We can’t see what you are doing, so you are the one in the best position to investigate further. Try to construct a more “pure” experiment. E.g. try “Kernel → Restart and Clear Output” and then do “Cell → Run All”. When you start from a clean state, does that change the behavior, e.g. make it more reproducible?
Or maybe we get lucky and someone who knows more than I do about YOLO and TF in general will be able to suggest a better theory …
If only there was a way to take a list of bounding boxes and confidences scores and suppress the likely duplicate boxes, say all the ones with the non-maximum confidence scores. That would be pretty useful.
@paulinpaloalto had it right, as usual. It is difficult to be precise when describing what “YOLO” does since there are multiple versions and multiple code implementations of each version that have appeared over the years and the OP doesn’t elaborate on which was used. That said, it is a common fundamental idea across all versions that each grid cell + anchor box pair acts as an independent detector and makes its own prediction about object presence, location, shape, and class. Therefore it is not at all unexpected when there are multiple predictions of the same object produced by different nearby detectors from each forward propagation. And given that each detector has been trained with their own learned parameters, it is also not unexpected that they produce slightly different predicted values. In order to prune the list of predictions to only the best, a post-neural-net processing step must occur. Non-maximum suppression is one such step. Here is the official TensorFlow version of it:
https://www.tensorflow.org/api_docs/python/tf/image/non_max_suppression
You pass in lists of box coordinates and confidence scores, along with IOU and confidence score thresholds and a maximum count, and get back a list. The list represents boxes that have a confidence score higher than the threshold. Additionally, if there is a set of boxes that are determined to be effectively co-located based on an IOU exceeding the provided IOU threshold, then only the member with the highest confidence score is retained.
The result of the pruning is a subset of the original list where each box represents the highest confidence prediction for that location. If you really want only one prediction per image, run NMS with a max_output_size = 1
<rant>
It is important to recognize that YOLO is not merely the CNN. It is critical to pick a useful number of anchor boxes that have shapes derived from the operational data. It is critical to perform post-processing. It is critical to understand which version of YOLO is being used, and how it was trained. You shouldn’t just download some code and/or prebuilt models from github and expect good results on all inputs. </rant>