Is there a reason why the non-max suppression is only applied in the prediction phase and not in the learning phase ?
Thanks for your help
Non-max suppression is used at prediction-time to prune predictions that are likely duplicates. That is, more than one location in the CNN output is making a prediction on the same input object.
At training time there is no uncertainty about the input object center or shape and thus no uncertainty about which CNN output location (grid cell and anchor box) is responsible for it. The output ‘pruning’ , if you want to think of it that way, is performed by the cost function and optimization iteratively driving out prediction imprecision.
So during the learning phase only the bounding box with the highest IOU with the ground truth is taken and rest of the predicted(wrongly) are discarded?
Also, when non-max suppression is applied at prediction time, then only the bounding box with the highest confidence score remains out of all the bounding boxes for a particular objects? To me, non-max seems like a confusing thing as to why it goes on iteration and removes boxes below a particular threshold and why not take the bounding box with max score and discard all other bounding boxes?
The algorithm and the cost function changed a little across the versions of YOLO. The code in this week’s notebook is based on what the authors called YOLO 9000 but is often referred to as v2. The paper you linked in a separate thread is the original, v1. I haven’t looked at v1 code for a long time, but I know the ‘highest IOU with ground truth is taken’ part of your question is not how v2 works. Bounding box localization is learned through application of the cost function which drives iterative refinement of the predicted bounding box parameters bx, by, bw, and bh.
NMS in v2 has a nuance that is easy to miss. After forward propagation it is possible to have predictions made by more than one grid cell/anchor box that are actually on the same image object. Here you use IOU not with ground truth (which you don’t have) but with two predictions. If the IOU is high, they are likely the same object, and the one with the lower confidence will be pruned from the list. So you start with the highest confidence box, iterate to look for and remove possible duplicates, then go to the next highest remaining and iterate to look for and remove possible duplicates, etc. Note that you are actually pruning predictions withhigh IOU, which are likely to be the duplicates. Predictions with low IOU are likely to be a different object close by or overlapping in the image, like a person standing in front of a car and will not be pruned in NMS. This is counterintuitive.