Let’s say we have 9 grids in total and have 1 object in the image so we have 1 midpoint for it. One grid carries this midpoint’s coordinates which means its ground truth label is something like [1 bx by bh bw 1 0 0], thus other grid’s ground truth label is [ 0 ? ? ? ? ? ? ? ].
I understand that each grid tries to get some probability of having an object in it, it’s quite obvious and they can give a high probability even though they do not have an object in them by mistake and consequently we will have multiple predictions for the object in grids.
But since the other 8 grid’s ground truth is [ 0 ? ? ? ? ? ? ? ] how come they predict bounding box coordinates, because they can not affect Loss Function in terms of bounding box coordinates(because we didn’t define any ground truth coordinates for them). How they can predict these bounding boxes ?
YOLO is pretty deep waters. There are a bunch of threads from ai_curious which go into a lot of detail about how all this works. Here’s a good one to start with and then it links to some of the others.
The short answer to your question is that objects can span multiple grid cells, but they are only reported by the grid cell that contains their centroid. See the thread linked above for the real explanation.
To understand YOLO I suggest it is important to separate what happens during initial training setup from both what happens during training and later during operational use.
During training setup, for each labelled object in ground truth there is indeed exactly one detector location (grid cell plus anchor box) that has non-zero values. The image relative center location and extent of an object bounding box are either provided directly in the labelled training data or trivial to compute from its corners. Once you know the pixel coordinates of the center location, it is similarly trivial to assign it to one detector. Grid cell index derives from center location, anchor box derives from shape. So far so good, but now it starts to get more complicated.
At the and of each forward propagation during training, each detector location makes predictions. The YOLO CNN has a multidimensional output shape, and every location represents a prediction. It can’t not make all these predictions - all the activations in the last layer must output values. In the loss function, all these values are ‘compared’ to their ground truth values. The YOLO loss function has many components (object presence, center location, shape, and class) but they all use sum of squares to compute error. Having tried training a YOLO CNN myself, I can assure you that at the beginning of training the predictions are terrible - crazy values everywhere. Eventually, if training were perfect , the predictions would exactly mirror ground truth and there would only be non-zero predictions where there are actual ground truth labels. However, training isn’t perfect, and it is extremely likely that some detector locations get it wrong…they make non-zero valued predictions in the absence of ground truth objects at that location.
This possibility of getting it wrong thus is also present at runtime when there is no ground truth. Based on their training, each detector location reports what it ‘sees’. Not only is it possible that more than one detector location ‘sees’ an object, for any object with a center near a grid cell boundary it is actually likely that this occurs. Two neighbors can each report that the center is theirs. Post-CNN processing in the form of confidence thresholds and Non-max-suppression is required to resolve the ‘duplicates’.
Hope this helps.