How does a cell detect a bounding box bigger than itself, YOLO?

This is correct, but it is important to see how that happens, which is described in one of the replies above in this thread. At training time you know the ground truth bounding box location, the training image dimensions, and the number of grid cells. From this, you can calculate which pixel is the ground truth bounding box center, and map that to one specific grid cell. That grid cell is the given a 1 for object presence, and the other grid cells 0. Grid cell center location / object presence is included as one component of the cost function, so the network ‘learns’ how to mimic the manual assignment. At runtime, it just makes predictions based on the input signal and the learned parameters of the neural net.

For objects in the middle of, and wholly contained by, the area corresponding to one grid cell, the center is predicted well. For objects that straddle grid cell regions and/ or are bigger than a single grid, you may have multiple predictions for the same object. Non-max-suppression comes in to play to disambiguate then.