I say that because we’re considering the case where there is only one object, but two grid cells ‘claim’ it to use your word. The object center is only in one of those locations, so any others are mistakes.

The key to understanding how a bounding box prediction can be larger than one grid cell is this diagram from the paper…

Bounding Box width and height b are multiples of anchor box width and height. Here p is used for anchor box because the paper refers to them as * priors*. The anchor box shape is multiplied by e^{t}, where t is the direct output of the network. e^t can be any positive number. If t \gt 0 then e^t > 1. and b will be larger than p, and even larger than grid cell size.

Here’s another recent thread that should resonate…