The image isn’t actually divided. The entire image is input, once, into the neural net, which then makes a large number of predictions against it.
It is the training data labels, not the image, that get configured to match the grid cell and anchor box (and number of classes) specifications.
X - the input, is a single image - the entire thing
Y - the training data labels, is a multidimensional object with shape according to the grid cell and anchor box count
\hat{Y} - the network output, estimated values, or predictions, is the same shape as Y.
The work of setting up the training data for YOLO is mapping the object bounding boxes to the appropriate location in Y.
Try searching the forum on something like YOLO training data to find other related discussions. Here is one such…