There is something I don’t understand in YOLO.
What if the car image was presnt in multiple grid cells like so?
How can we assign the image to any specific cell and how do we create the car’s bounding box?

The way that is handled is that the object will be assigned to the cell that contains the centroid of the object. The bounding boxes are created by the algorithm based on the training, just as the ability to figure out which cell to assign the object to or even to recognize an object in the first place is also learned through training. I totally agree that it seems like magic, but it works. Of course training YOLO is a very big deal, requiring a lot of labelled data and a lot of compute. It’s been a while since I watched those lectures, but I’m pretty sure Prof Ng covered everything I just said in the lectures. If you have more doubts, it might be worth “holding that thought” until you’ve watched all the lectures on Object Detection.
Here’s my version:
First, it’s important to distinguish whether we are talking about during establishing the training data, or at the end of a forward pass of the YOLO network.
If the former, the correct grid cell for the centroid of the object is easily determined mathematically since the label must provide ground truth bounding box coordinates and we know the image dimensions and the grid size. In the attached image, one would be at i=0, j=2 and one would be at i=3, j=1. As @paulinpaloalto mentions, these ground truth values then drive learning during training iterations as the network attempts to reproduce those outputs. In other words, you create a y of the same shape as the network output you need, assign it values known from the ground truth labels, and use a cost function that minimizes y - \hat{y}
Notice, however, that there is no assignment of grid cell locations (nor any other output like bounding box shape or object class) at the end of a forward pass of the network - the predicted outputs \hat{y}. Rather, the network just outputs what it outputs. In this example, it would be 20 sets of predictions since it is a 5x4 grid (ignores anchor boxes since they aren’t depicted here). Hopefully the training has gone well and 18 of the predictions are empty or at least very low confidence, and two will be non-empty and high confidence. Those two will be the network output locations corresponding to the two grid cells that were non-empty in the training data, i=0, j=2 and i=3, j=1.
Training and predicting the anchor box, bounding box shape, object confidence and object class values all work exactly the same way, which is part of the elegance of the YOLO idea. Hope this helps.
Thank you @paulinpaloalto and @ai_curious for both of your answers. Much clearer now!
I have seen Andrew’s lectures many times over and over and still struggling to understand all the details about YOLO mostly regarding what happens when an object spans more than one grid cell. The only answer I find repeatedly given this question is the canned lines “only one grid cell is responsible for predicting the object” or “object is assigned to only one grid cell” without giving any detail as to how this is actually accomplished. e.g. in regards to the original question and the above picture how @ai_curious can claim that “Hopefully the training has gone well and 18 of the predictions are empty or at least very low confidence, and two will be non-empty and high confidence.” …. where as looking at the above picture it is clear that more than one grid cell will predict the cars with high probability… so when are forcing the YOLO to stop predicting the object in cells to it was not “assigned” aren’t we training to ignore major parts of object that fall in non-assigned grid cells and just use the part that falls in “assigned” grid cell? Wouldn’t that confuse the network and ultimately degrade its prediction ability???
This thread has been cold for three years.
Maybe start a new one.
In the example above there will be 2 grid cells with non-empty training data and 18 with empty. The grid cells with object centers contained within them have a ground truth bounding box that includes a perimeter around the entire object, not just those parts of the object within the grid cell that contains its center. The grid cells that don’t contain an object’s center will have no information about any objects. During training, incorrect grid cell center and bounding box shape predictions are penalized by the cost function. If training is successful, the network learns to predict objects in the locations and grid cells where they actually are, and to predict nothing in locations where they are not. Sometimes the training goes well but at prediction time mistakes are made. Non-max suppression attempts to filter out, or suppress, lower confidence, ie non-max, false positives.
Does this help?
The mechanism is described above and elsewhere in related threads.
the correct grid cell for the centroid of the object is easily determined mathematically since the label must provide ground truth bounding box coordinates and we know the image dimensions and the grid size. In the attached image, one would be at i=0, j=2 and one would be at i=3, j=1
Assignment of the object center ( as well as the ground truth bounding box shape ) during creation of the training data turns out to be one of the simplest tasks of the YOLO algorithm.
Also mentioned above, during runtime, objects are not assigned to grid cells. The network just takes its input and outputs whatever it predicts.