Hi All
I have a question regarding how YOLO actually implement its grid (3 by 3 grid in lecture video).
From this picture, we can see that, suppose we have a ConvNet (with all fully connected layers rewritten as convolutional layers) trained on 14 by 14 images and apply that to 16 by 16 images, we will get 2 by 2 output. By this logic, if we apply the same ConvNet to 18 by 18 images, we will get 3 by 3 output.
That is to say, the above picture corresponds to a sliding window of stride 2.
But the claim in the “bounding box prediction” lecture video seems to be that, if you input a 3 by 3 grid of 14 by 14 image (42 by 42 in total), you should get 3 by 3 image. This is different from the calculation above.
So my question is, if I apply a ConvNet trained on small window size to a larger images, how is the output size determined? And how should we convolutionally implement a grid if we truly want to start from 42 by 42 and arrive at 3 by 3?
Update: after playing with some numbers, it seem that the stride of the sliding window is determined by the number of pooling layers. Here, the pooling layers are 2 by 2. If there is one such pooling layer, the stride of the sliding window is 2; if there is two such pooling layer, the stride of the sliding window is 4.
Hi @O_Sub_Kwon
The output size is determined by how convolutional and pooling layers process the input spatial dimensions. Convolutional layers keep spatial relationships, but pooling layers reduce the size by a factor related to their stride. In the example you provided, the 2x2 pooling layers cause the output grid size to shrink by a stride equal to the cumulative effect of each pooling layer, thus, two such layers would reduce the output size more than one. For a 42x42 input aiming to get a 3x3 grid output, you’d need to consider pooling strategies and adjusting padding, strides, or convolutional kernels.
Hope it helps! Feel free to ask if you need further assistance.