[C4W3] YOLO grid question

In the videos, Andrew uses a 3x3 grid over the car image(with the mountain road and snow). With 3x3 the intuition works quite well because the car fits in a cell, but then he says that in practice a smaller grid is used. But, if we overlay a 19x19 grid then each grid cell will contain only a tiny portion of the car. How can the network predict an accurate bounding box since it will just say that the bounding box is the grid cell in which it predicted the car(but the car is inside many grid cells and each cell contains a small part of the object)?

It is my understanding that for each grid cell the network will produce an output vector. Each output vector also encodes a bounding box, which in this case will contain only a small portion of the object.

Thanks!

Maybe take a look at this previous post, and see if it addresses your question?

That may be a lot to digest if it’s your first time really digging in to this algorithm, but it shows the equations YOLO is using to relate predictions to anchor box and grid cell sizes. No other way to really understand it, in my opinion.

The tldr shortcut is that every network output location (m, S, S, B) makes a vector of (1 + 4 + C) predictions based on input from the entire image, thus these predictions are based on input that is not constrained to the specific grid cell or anchor box shape they represent. It is one of the key differentiating aspects of YOLO from sliding windows and other region-based approaches.

2 Likes