I tried reading the YOLOv1 paper and am unable to understand this :
It weights localization error
equally with classification error which may not be ideal.
Also, in every image many grid cells do not contain any
object. This pushes the “confidence” scores of those cells
towards zero, often overpowering the gradient from cells
that do contain objects. This can lead to model instability,
causing training to diverge early on
Above is from page number 3 of the paper under the Training section. Link to the paper :
Regarding the first part of your question you can have a look here. The second part of your question is addressed here.
Also, does YOLO give multiple bounding boxes during training only? Is it the case that during prediction/testing, YOLO outputs only one bounding box?
The paper states "On PASCALVOC thenetwork predicts 98 bounding boxes per image and class probabilities for each box. "
The output of a CNN forward propagation is the same regardless of whether training or what you called prediction/testing. What differs is whether there is backprop and iterative modification (learning) of parameters. But forward prop produces the same dimension output in all cases. In YOLO that dimension, and thus the number of bounding box predictions, is driven by the SxSxB shape of the output layer. You train it to make that many predictions at a time, and when you run it operationally, that’s what it does. This is generally true of all machine learning: you train how you fight, and fight how you train.
Sliding windows, other region based approaches, and YOLO were all invented to deal with the challenge of detecting multiple objects per scene. YOLO did it close to as well and much much faster, which is why you are studying it 5 years on.
So after these 98 bounding boxes are predicted, and suppose that the image has only 5 objects(and thus we should get only 5 bounding boxes) then this algorithm, for each grid cell, selects the bounding box having the highest probability among all the bounding boxes for that particular grid cell?
Yes, that is demonstrated in figure 2 in the paper.
That also means that Intersection of Union is not used during the time of prediction as there is no ground truth available?
No, because intersection over union is used in non-max suppression in order to find the highest probability among the bounding boxes. This is explained in the assignment. On this, the yolo paper states “some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections.” (p. 4).
Remember IOU, or Jaccard Index, is a general purpose mechanism for comparing two regions. It is used differently in different parts of the overall YOLO solution. Initially it is used to compare anchor boxes with ground truth bounding boxes during training data set up. Later it is used to compare two outputs of forward propagation (neither of which is a ground truth bounding box) during the NMS phase of operational execution.