Week 3: finding the correct cell in YOLO

Doron_Modan · January 5, 2023, 9:32pm

There is something I don’t understand in YOLO.
What if the car image was presnt in multiple grid cells like so?
How can we assign the image to any specific cell and how do we create the car’s bounding box?

paulinpaloalto · January 5, 2023, 11:11pm

The way that is handled is that the object will be assigned to the cell that contains the centroid of the object. The bounding boxes are created by the algorithm based on the training, just as the ability to figure out which cell to assign the object to or even to recognize an object in the first place is also learned through training. I totally agree that it seems like magic, but it works. Of course training YOLO is a very big deal, requiring a lot of labelled data and a lot of compute. It’s been a while since I watched those lectures, but I’m pretty sure Prof Ng covered everything I just said in the lectures. If you have more doubts, it might be worth “holding that thought” until you’ve watched all the lectures on Object Detection.

ai_curious · January 6, 2023, 7:48am

Here’s my version:

First, it’s important to distinguish whether we are talking about during establishing the training data, or at the end of a forward pass of the YOLO network.

If the former, the correct grid cell for the centroid of the object is easily determined mathematically since the label must provide ground truth bounding box coordinates and we know the image dimensions and the grid size. In the attached image, one would be at i=0, j=2 and one would be at i=3, j=1. As @paulinpaloalto mentions, these ground truth values then drive learning during training iterations as the network attempts to reproduce those outputs. In other words, you create a y of the same shape as the network output you need, assign it values known from the ground truth labels, and use a cost function that minimizes y - \hat{y}

Notice, however, that there is no assignment of grid cell locations (nor any other output like bounding box shape or object class) at the end of a forward pass of the network - the predicted outputs \hat{y}. Rather, the network just outputs what it outputs. In this example, it would be 20 sets of predictions since it is a 5x4 grid (ignores anchor boxes since they aren’t depicted here). Hopefully the training has gone well and 18 of the predictions are empty or at least very low confidence, and two will be non-empty and high confidence. Those two will be the network output locations corresponding to the two grid cells that were non-empty in the training data, i=0, j=2 and i=3, j=1.

Training and predicting the anchor box, bounding box shape, object confidence and object class values all work exactly the same way, which is part of the elegance of the YOLO idea. Hope this helps.

Doron_Modan · January 6, 2023, 9:21pm

Thank you @paulinpaloalto and @ai_curious for both of your answers. Much clearer now!

Topic		Replies	Views
YOLO Algorithm and grid cells Convolutional Neural Networks week-3	11	86	March 19, 2025
YOLO algorithm DLS COURSE 4 Convolutional Neural Networks	2	689	September 27, 2021
How does a cell detect a bounding box bigger than itself, YOLO? Convolutional Neural Networks	6	823	July 10, 2021
[C4W3] YOLO grid question Convolutional Neural Networks	1	668	August 26, 2021
YOLO algorithm bounding boxes car detection Convolutional Neural Networks	1	609	January 23, 2022

Week 3: finding the correct cell in YOLO

Related topics