Questions on Yolo Algorithm

ecsca · February 20, 2024, 10:15pm

Link to the classroom item: https://coursera.org/share/3e55e36d0984037c1e48b47670d24f6d

Hello, while studying YOLO algorithm, I have a question on it.
Based on what I understand, YOLO algorithm process 3 steps as below:

SXS grid on input
2-1. Draw B number of bounding boxes for each cell and calculate confidence score for each
2-2. Classification for each cell(not for each bounding boxes)
Pick bounding boxes through NMS.

And the algorithm is named as YOLO(You Only Look Once) because this algorithm does drawing Bounding box(2-1) and Classification(2-2) at once, instead of doing them in a separate step.

Then, here is my question:
It seems that the bounding box can be drawn out of bounds, then does it mean the algorithm need to look at the picture “Number of cells” * B?

I thought that we need to look at only each cell’s data for process 2-1, 2-2 of each cell, but it seems not, because drawing bounding box need data out of cell’s bound.

Please let me know if my understanding is correct.

ai_curious · February 21, 2024, 12:43pm

Here are some thoughts related to your questions and assertions. Hope it helps.

YOLO doesn’t draw bounding boxes: it predicts bounding box center location and shape. Whether they end up being visualized or not depends on what the application is being used for. If it’s driving an autonomous vehicle, for example, likely not.

It is common to hear or read that YOLO divides the input image into grid cells. This is not precisely correct. The image isn’t divided at all, which is where the name comes from…you only look at (input) the image once. The entire image is input to the CNN and processed through the forward propagation exactly once. What is divided is the ground truth training data Y and the output predictions \hat{Y}.

One needs to be careful making assertions about what YOLO does or how it works, because there are many versions out there and they don’t all work exactly the same way. I think v1 did make one class prediction for all the predicted bounding boxes in a given grid cell, whereas with v2 there was one class prediction for each predicted bounding box in a grid cell.

It is correct that a predicted bounding box can be larger than a grid cell. There are threads already in the forum that describe the math of how and why. It is important to connect this idea with the fact that the entire image is the source of information to each bounding box and classification prediction, not just a grid cell shaped subregion of it. Also, again, it is object center location and shape that are being predicted by the YOLO neural net, bounding boxes are not drawn by it.

This related thread has more background. You can find others like it using Search.

ecsca · February 21, 2024, 10:36pm

I really appreciate your detailed answer and also the thread that you shared.

Below block from your thread(Detecting Multiple Objects using YOLO - Grid Cells plus Anchor Boxes) helped a lot for me to understand YOLO.

So, YOLO allows us to make S * S * B classification output in a single forward propagation by adding one more dimension(of course corresponding model change), and also we can add one more dimension C for bounding box prediction if we need, and it can be also done in a same forward propagation,

therefore we can do multiple classification and bounding box prediction in a single forward propagation which is the meaning of “You Look Only Once”.

ai_curious · February 23, 2024, 12:32pm

At the time YOLO was published, 2015-ish, object detection state of the art was coming from sliding windows and region-based algorithms. They could produce multiple numeric outputs (predictions) per forward propagation- handling both classification and location for a single object. But forward propagation needed to be run multiple times to cover multiple objects distributed throughout the input image. As a result, those algorithms just weren’t fast enough to enable autonomous vehicles. The innovation of YOLO was to detect (classify + localize) multiple objects from the entire image in a single pass. It was competitive on accuracy with then-current algorithms but significantly faster.

Topic		Replies	Views
Course4 Week3: Understanding YOLO Algorithm Convolutional Neural Networks coursera-platform	5	819	March 18, 2025
How does a cell detect a bounding box bigger than itself, YOLO? Convolutional Neural Networks coursera-platform	6	841	July 10, 2021
Detecting Multiple Objects using YOLO - Grid Cells plus Anchor Boxes Convolutional Neural Networks coursera-platform	6	1635	March 16, 2024
YOLO - How come algortihm predicts mutiple bounding box without knowing cordinates of it? Convolutional Neural Networks coursera-platform	2	637	December 2, 2021
YOLO Algorithm and grid cells Convolutional Neural Networks week-module-3 , coursera-platform	11	98	March 19, 2025

Questions on Yolo Algorithm

Related topics