There is something I don’t understand in YOLO.
What if the car image was presnt in multiple grid cells like so?
How can we assign the image to any specific cell and how do we create the car’s bounding box?
The way that is handled is that the object will be assigned to the cell that contains the centroid of the object. The bounding boxes are created by the algorithm based on the training, just as the ability to figure out which cell to assign the object to or even to recognize an object in the first place is also learned through training. I totally agree that it seems like magic, but it works. Of course training YOLO is a very big deal, requiring a lot of labelled data and a lot of compute. It’s been a while since I watched those lectures, but I’m pretty sure Prof Ng covered everything I just said in the lectures. If you have more doubts, it might be worth “holding that thought” until you’ve watched all the lectures on Object Detection.
Here’s my version:
First, it’s important to distinguish whether we are talking about during establishing the training data, or at the end of a forward pass of the YOLO network.
If the former, the correct grid cell for the centroid of the object is easily determined mathematically since the label must provide ground truth bounding box coordinates and we know the image dimensions and the grid size. In the attached image, one would be at i=0, j=2 and one would be at i=3, j=1. As @paulinpaloalto mentions, these ground truth values then drive learning during training iterations as the network attempts to reproduce those outputs. In other words, you create a y of the same shape as the network output you need, assign it values known from the ground truth labels, and use a cost function that minimizes y - \hat{y}
Notice, however, that there is no assignment of grid cell locations (nor any other output like bounding box shape or object class) at the end of a forward pass of the network - the predicted outputs \hat{y}. Rather, the network just outputs what it outputs. In this example, it would be 20 sets of predictions since it is a 5x4 grid (ignores anchor boxes since they aren’t depicted here). Hopefully the training has gone well and 18 of the predictions are empty or at least very low confidence, and two will be non-empty and high confidence. Those two will be the network output locations corresponding to the two grid cells that were non-empty in the training data, i=0, j=2 and i=3, j=1.
Training and predicting the anchor box, bounding box shape, object confidence and object class values all work exactly the same way, which is part of the elegance of the YOLO idea. Hope this helps.
Thank you @paulinpaloalto and @ai_curious for both of your answers. Much clearer now!