YOLO center of object Detection

Just gone through the first video regarding YOLO
My question is,
we are dividing the image into some grids, and getting the output for each grid cell as a vector in output volume.
If an image spans across multiple grids, then how will the convnet determine the center of the object?

It learns to do that during training. As you observe, there is no requirement that an object be contained in a single cell: it will be identified by the cell that contains the centroid of the object.

A post was split to a new topic: How can the convnet always end up with a particular dimension?