Non-max suppression - cell smaller than object?

The detection of objects happens independent of the grid cells and there is no requirement that an object be contained in a grid cell. The grid cells are just used to organize the output, because a given detected object is attached to the grid cell that contains its centroid. The training of the network is for detecting whole objects and that is driven (as in all “supervised learning” cases) by how the input training data is labeled.

There are a number of threads on the forums that go into quite a bit more depth on how YOLO works and is trained than we get in the lectures or the assignments. Here’s a good one to start with that discusses how the training works.