I don't get it why, when having a 3x3 grid, it can still detect an object that overlaps 2 grids?


I dont’ get it why the pedestrian there have bounding boxes, larger than the grid? I dont understand how each grid communicates with each other. Like isn’t the bh bw maximum is only 1 per grid?

And also please explain to me how the bx, by (middle) points is used in the 3x3 grid, whats the relevance?

Hi @Zolids

The Yolo network predicts adjustments to predefined anchor boxes of varying dimensions, and convolutional layers learn features across the entire image, that makes them to capture objects spanning multiple cells; while bx and by values, normalized between 0 and 1, define the bounding box center relative to each grid cell’s top-left corner, that gives precise localization within the coarse grid cell divisions.

Hope it helps! Feel free to ask if you need further assistance.

1 Like

+1 for calling out this really important point that isn’t always mentioned or emphasized. In YOLO v2, the predicted bounding box shape is the anchor box shape times a factor. The derived bounding box shape is used in the cost function, but the factor is what the network is learning to generate. This also reinforces why it is important to have a good set of anchor boxes; the closer the anchor box shapes are to the ground truth bounding box shapes, the faster and better the network learns the weights that produce better factors and lower localization error.

@Zolids there are many threads that go deep on the questions you pose above. Here is one that might be useful:

It contains the mathematical expressions for what @Alireza_Saei wrote and that I quoted above. From those equations you can see how a predicted bounding box shape can exceed the dimension of a grid cell.