Here’s my version:
First, it’s important to distinguish whether we are talking about during establishing the training data, or at the end of a forward pass of the YOLO network.
If the former, the correct grid cell for the centroid of the object is easily determined mathematically since the label must provide ground truth bounding box coordinates and we know the image dimensions and the grid size. In the attached image, one would be at i=0, j=2 and one would be at i=3, j=1. As @paulinpaloalto mentions, these ground truth values then drive learning during training iterations as the network attempts to reproduce those outputs. In other words, you create a y of the same shape as the network output you need, assign it values known from the ground truth labels, and use a cost function that minimizes y - \hat{y}
Notice, however, that there is no assignment of grid cell locations (nor any other output like bounding box shape or object class) at the end of a forward pass of the network - the predicted outputs \hat{y}. Rather, the network just outputs what it outputs. In this example, it would be 20 sets of predictions since it is a 5x4 grid (ignores anchor boxes since they aren’t depicted here). Hopefully the training has gone well and 18 of the predictions are empty or at least very low confidence, and two will be non-empty and high confidence. Those two will be the network output locations corresponding to the two grid cells that were non-empty in the training data, i=0, j=2 and i=3, j=1.
Training and predicting the anchor box, bounding box shape, object confidence and object class values all work exactly the same way, which is part of the elegance of the YOLO idea. Hope this helps.