How does YOLO know if 3 cells make 1 object?

The grid cells and anchor boxes in YOLO don’t cooperate at all. Each grid cell + anchor box location, called a ‘detector’ in the original paper, makes a set of class and location predictions based on its training. Each prediction occurs in parallel, completely independently of one another.

In post-CNN processing, possible duplicate predictions made by multiple ‘detectors’ are disambiguated and filtered so only the highest confidence prediction is retained.

There is no information sharing across ‘detectors’ and no merging. If an object is spread across multiple grid cells, the center of the object is only in one of them, and that grid cell is the one that should be making the prediction for the entire object - not for only the part of the object within its grid cell. If each of the grid cells makes a prediction (again for the entire object) then they will be ranked by confidence with the lesser quality predictions suppressed.

@DHAiRYA there are already several old threads covering this topic. Try the search.

Here is one example of a related thread: