How YOLO algorithm is sure that their architecture exactly divide image into grid cell

wallik2 · January 29, 2023, 3:31am

Since the definition of grid is that every local spatial feature does not overlapping to each other.

And also since YOLO adopt convolutionally implementation. My question is that “How can YOLO guarantee that each unit in output layer represent non-overlapping local spatial features”

Here’s an illustration of what I expect the architecture to be looked like in order to claim it’s grid cell using conv. implementation

(Suppose the first conv. layer consisted of 8 filters)

I expect the first convolution layer must have square filter size equivalent to stride to guarantee non-overlapping.

If it does not satisfy this, how can YOLO strongly sure the local spatial feature representation of output unit of each is not-overlapping ??

paulinpaloalto · January 29, 2023, 3:55am

I’m not sure I understand your points, but I think you are basically “over assuming” here. Where does it say that spatial features can’t overlap? What if the picture includes a pedestrian who happens to be standing in front of a car or a truck? YOLO (or any other Object Identification and Localization algorithm) needs to be able to handle that and identify both “objects” (the pedestrian and the vehicle), right?

I think you should listen to all the YOLO lectures before you form your conclusions. In other words “hold that thought” and listen to all the Prof Ng has to say in this Object Detection section and I hope it will become more clear or at least that you will be able to compose a clearer question.

ai_curious · January 29, 2023, 12:59pm

A couple of thoughts to carry with you on your YOLO exploration journey….

There is no guarantee that objects won’t straddle grid cell boundaries, or even that the objects are smaller than a single grid cell. Unlike sliding windows, YOLO handles situations like this by design.

There is a correlation between the convolution sizes, the input image sizes and the grid cell size, but it’s not exactly the one you diagram. The ratio of input image size to grid cell size drives the number of output predictions that are made, but you are free to pick different convolution filter number and shape in the hidden layers so long as the last layer produces the desired number of outputs. The convolutions in the hidden layers are not generally the shape of the grids at all. I have pasted images of the first three (Redmon et al) YOLO architectures below…

V1

Topic		Replies	Views
Questions about YOLO Convolutional Neural Networks	10	2309	August 24, 2022
A clarification about Image Classification and Localization Algorithm and YOLO Convolutional Neural Networks	2	688	August 28, 2022
How does a cell detect a bounding box bigger than itself, YOLO? Convolutional Neural Networks	6	773	July 10, 2021
Grids in YOLO Algorithm Convolutional Neural Networks week-3	6	378	January 15, 2024
YOLO - How does Bounding box get identified when Object spawns multiple sliding windows(Grids) Convolutional Neural Networks	2	713	November 25, 2021

How YOLO algorithm is sure that their architecture exactly divide image into grid cell

Related topics