Here are some thoughts related to your questions and assertions. Hope it helps.
YOLO doesn’t draw bounding boxes: it predicts bounding box center location and shape. Whether they end up being visualized or not depends on what the application is being used for. If it’s driving an autonomous vehicle, for example, likely not.
It is common to hear or read that YOLO divides the input image into grid cells. This is not precisely correct. The image isn’t divided at all, which is where the name comes from…you only look at (input) the image once. The entire image is input to the CNN and processed through the forward propagation exactly once. What is divided is the ground truth training data Y and the output predictions \hat{Y}.
One needs to be careful making assertions about what YOLO does or how it works, because there are many versions out there and they don’t all work exactly the same way. I think v1 did make one class prediction for all the predicted bounding boxes in a given grid cell, whereas with v2 there was one class prediction for each predicted bounding box in a grid cell.
It is correct that a predicted bounding box can be larger than a grid cell. There are threads already in the forum that describe the math of how and why. It is important to connect this idea with the fact that the entire image is the source of information to each bounding box and classification prediction, not just a grid cell shaped subregion of it. Also, again, it is object center location and shape that are being predicted by the YOLO neural net, bounding boxes are not drawn by it.
This related thread has more background. You can find others like it using Search.