@ai_curious Thanks! This gives me a fair bit of clarity, I even looked up the model architecture online and it makes sense. What threw me off was the course giving emphasis to each grid cell performing object localization which made me imagine something completely different.
So does this mean that the grid structure is purely for training and us, assigning the center to grid and defining bounding boxes to allow the yolo architecture to learn it, but model in itself is not diving the image into a s by s grid, its just convolution reducing its size to a s by s output?