YOLO is by far the most complicated system we’ve seen so far, so it’s no wonder that it takes some serious headscratching to understand. The point is not that the algorithm can’t see things outside of the current grid cell: the grid cells are just used to organize the computation. A given object will be reported only for the grid cell that contains its centroid, but there is no requirement that the bounding box of the object lies completely within the grid cell. The bounding box “is what it is”. Over the next couple of lectures and in the assignment, you’ll also see how they deal with the fact that the same object can be reported multiple times with slightly different bounding boxes. In all this, Prof Ng doesn’t really say much about how all this complexity gets trained, but it’s a safe bet that “it’s complicated”.
If you have more detailed questions about any of this and want to go deeper, there are some great threads from fellow student ai_curious who has done some serious work using and studying YOLO and then writing about it. Here’s a good one to start on and this one is more specific to the question of multiple bounding boxes.