I understand that each cell in the 19x19 grid is responsible for making (#number of anchor box) predictions to give a y vector each for all objects whose center lies in the cell.
What I do not understand is:
i) Andrew mentions how object localization, classification is perfomed. Does this mean each grid cell performs its own Convolution feed-forward to give y?
ii)Is this Conv feed-forward for localization and classification per grid cell, taking just the grid cell region as input or the whole image? If the answer is former, isn’t each grid cell is actually very small and may not contain enough of the object to be able to tell what the object is or if it exists?
In short:
- How does the algorithm predict bounding boxes that are larger than the grid cell?
- How does the algorithm know in which cell the center of the object is located?
At training time the grid cell in which the object center occurs is easy to determine from the grid dimensions and the ground truth bounding box. At runtime each cell makes a prediction based on its training and either predicts an object is there or not, but no grid cell knows whether it contains the object center or not. They all just do what they were trained to do, all at the same time.
In YOLO, unlike sliding windows or some other approaches, the entire image is used as input to the CNN forward pass. A bounding box prediction is therefore not limited by the grid cell size.
@ai_curious Thanks! This gives me a fair bit of clarity, I even looked up the model architecture online and it makes sense. What threw me off was the course giving emphasis to each grid cell performing object localization which made me imagine something completely different.
So does this mean that the grid structure is purely for training and us, assigning the center to grid and defining bounding boxes to allow the yolo architecture to learn it, but model in itself is not diving the image into a s by s grid, its just convolution reducing its size to a s by s output?
Remember YOLO was invented to solve problems with the solutions that existed in 2016. One was how to detect more than one object in an image. If you try to do it by dividing up the image, running the neural network on each piece separately, it can detect multiple objects but has high computational cost and has issues with object and window alignment and size. Changing the network output size/shape and using the entire image as the single input addresses both. You get multiple object detections from a single forward pass without incurring the entire runtime cost of the ‘windows’ approaches.
The grid is thus important for both training and runtime, but not exactly in the same way. In training, it is a preprocessing step to set up the training data with a 1 for object presence in the grid cell where the object center falls and 0 for all others. This is computed using the image size, the ground truth bounding box positions, and the grid cell size. Training then penalizes predictions that get it wrong. At runtime the grid cell shape drives the neural network output shape, and the number of predictions that will be made simultaneously from forward pass.
1 Like
Thanks, I think I’ve got it!
1 Like
Answer for : * How does the algorithm know in which cell the center of the object is located? *
The centre of object is learned end to end and the predictions of centre is based entirely on training data.
Am I correct?
This is correct, but it is important to see how that happens, which is described in one of the replies above in this thread. At training time you know the ground truth bounding box location, the training image dimensions, and the number of grid cells. From this, you can calculate which pixel is the ground truth bounding box center, and map that to one specific grid cell. That grid cell is the given a 1 for object presence, and the other grid cells 0. Grid cell center location / object presence is included as one component of the cost function, so the network ‘learns’ how to mimic the manual assignment. At runtime, it just makes predictions based on the input signal and the learned parameters of the neural net.
For objects in the middle of, and wholly contained by, the area corresponding to one grid cell, the center is predicted well. For objects that straddle grid cell regions and/ or are bigger than a single grid, you may have multiple predictions for the same object. Non-max-suppression comes in to play to disambiguate then.