How does YOLO know if 3 cells make 1 object?

YOLO seems like an interesting algo but I wasn’t able to understand it well enough. Specifically, I am not able to understand how an object is detected if it spans across multiple grids. In the image below (it is a 5x5 grid, please ignore the smaller grid inside the cell), since each cell would predict its own bounding box coordinates and class probabilities, how would the individual cells know that they are part of the same object?


The grid cells and anchor boxes in YOLO don’t cooperate at all. Each grid cell + anchor box location, called a ‘detector’ in the original paper, makes a set of class and location predictions based on its training. Each prediction occurs in parallel, completely independently of one another.

In post-CNN processing, possible duplicate predictions made by multiple ‘detectors’ are disambiguated and filtered so only the highest confidence prediction is retained.

There is no information sharing across ‘detectors’ and no merging. If an object is spread across multiple grid cells, the center of the object is only in one of them, and that grid cell is the one that should be making the prediction for the entire object - not for only the part of the object within its grid cell. If each of the grid cells makes a prediction (again for the entire object) then they will be ranked by confidence with the lesser quality predictions suppressed.

@DHAiRYA there are already several old threads covering this topic. Try the search.

Here is one example of a related thread:

@ai_curious thank you for the clarification. I took away the misleading comment.

1 Like

Thank you for the clarification, it makes more sense now.

1 Like