This is just a simplified example to ask my question :
If I have a YOLO model that was trained on input pictures with 3x3 grids, do I need to retrain all my YOLO model if now I use input pictures with a 9X9 grids ? Or is the YOLO model able to generalize to input pictures decomposed by a different number of grids ?
Inputs to a YOLO CNN don’t know anything about grids. They are just images. The same image can be input to any YOLO CNN regardless of how it downsamples to produce its final output. What the number of grid cells impacts is the output shape, and thus the shape and number of trainable parameters in the hidden and final layers.
YOLO CNN output contains S*S*B*(1+4+C) floating point numbers. Switching S from 3 to 9 increases the number of outputs by a factor of 9. That requires a corresponding output layer shape, with parameters that support the required computation. It might be possible to take the output of a network trained for one output shape, and use transfer learning to train only a final layer of the new output shape, but my intuition is that it would only be feasible going from more grid cells to less. If you have already downsampled enough to get to S=3, I think you have lost information necessary to support S=9. Probably a better strategy to adjust the entire network architecture to arrive at the preferred shape and retrain from scratch.