How YOLO algorithm is sure that their architecture exactly divide image into grid cell

Since the definition of grid is that every local spatial feature does not overlapping to each other.

And also since YOLO adopt convolutionally implementation. My question is that “How can YOLO guarantee that each unit in output layer represent non-overlapping local spatial features”

Here’s an illustration of what I expect the architecture to be looked like in order to claim it’s grid cell using conv. implementation


(Suppose the first conv. layer consisted of 8 filters)

I expect the first convolution layer must have square filter size equivalent to stride to guarantee non-overlapping.

If it does not satisfy this, how can YOLO strongly sure the local spatial feature representation of output unit of each is not-overlapping ??

I’m not sure I understand your points, but I think you are basically “over assuming” here. Where does it say that spatial features can’t overlap? What if the picture includes a pedestrian who happens to be standing in front of a car or a truck? YOLO (or any other Object Identification and Localization algorithm) needs to be able to handle that and identify both “objects” (the pedestrian and the vehicle), right?

I think you should listen to all the YOLO lectures before you form your conclusions. In other words “hold that thought” and listen to all the Prof Ng has to say in this Object Detection section and I hope it will become more clear or at least that you will be able to compose a clearer question.

A couple of thoughts to carry with you on your YOLO exploration journey….

There is no guarantee that objects won’t straddle grid cell boundaries, or even that the objects are smaller than a single grid cell. Unlike sliding windows, YOLO handles situations like this by design.

There is a correlation between the convolution sizes, the input image sizes and the grid cell size, but it’s not exactly the one you diagram. The ratio of input image size to grid cell size drives the number of output predictions that are made, but you are free to pick different convolution filter number and shape in the hidden layers so long as the last layer produces the desired number of outputs. The convolutions in the hidden layers are not generally the shape of the grids at all. I have pasted images of the first three (Redmon et al) YOLO architectures below…

image

V1