Hello, while studying YOLO algorithm, I have a question on it.
Based on what I understand, YOLO algorithm process 3 steps as below:
SXS grid on input
2-1. Draw B number of bounding boxes for each cell and calculate confidence score for each
2-2. Classification for each cell(not for each bounding boxes)
Pick bounding boxes through NMS.
And the algorithm is named as YOLO(You Only Look Once) because this algorithm does drawing Bounding box(2-1) and Classification(2-2) at once, instead of doing them in a separate step.
Then, here is my question:
It seems that the bounding box can be drawn out of bounds, then does it mean the algorithm need to look at the picture “Number of cells” * B?
I thought that we need to look at only each cell’s data for process 2-1, 2-2 of each cell, but it seems not, because drawing bounding box need data out of cell’s bound.
Please let me know if my understanding is correct.
Here are some thoughts related to your questions and assertions. Hope it helps.
YOLO doesn’t draw bounding boxes: it predicts bounding box center location and shape. Whether they end up being visualized or not depends on what the application is being used for. If it’s driving an autonomous vehicle, for example, likely not.
It is common to hear or read that YOLO divides the input image into grid cells. This is not precisely correct. The image isn’t divided at all, which is where the name comes from…you only look at (input) the image once. The entire image is input to the CNN and processed through the forward propagation exactly once. What is divided is the ground truth training data Y and the output predictions \hat{Y}.
One needs to be careful making assertions about what YOLO does or how it works, because there are many versions out there and they don’t all work exactly the same way. I think v1 did make one class prediction for all the predicted bounding boxes in a given grid cell, whereas with v2 there was one class prediction for each predicted bounding box in a grid cell.
It is correct that a predicted bounding box can be larger than a grid cell. There are threads already in the forum that describe the math of how and why. It is important to connect this idea with the fact that the entire image is the source of information to each bounding box and classification prediction, not just a grid cell shaped subregion of it. Also, again, it is object center location and shape that are being predicted by the YOLO neural net, bounding boxes are not drawn by it.
This related thread has more background. You can find others like it using Search.
So, YOLO allows us to make S * S * B classification output in a single forward propagation by adding one more dimension(of course corresponding model change), and also we can add one more dimension C for bounding box prediction if we need, and it can be also done in a same forward propagation,
therefore we can do multiple classification and bounding box prediction in a single forward propagation which is the meaning of “You Look Only Once”.
At the time YOLO was published, 2015-ish, object detection state of the art was coming from sliding windows and region-based algorithms. They could produce multiple numeric outputs (predictions) per forward propagation- handling both classification and location for a single object. But forward propagation needed to be run multiple times to cover multiple objects distributed throughout the input image. As a result, those algorithms just weren’t fast enough to enable autonomous vehicles. The innovation of YOLO was to detect (classify + localize) multiple objects from the entire image in a single pass. It was competitive on accuracy with then-current algorithms but significantly faster.