I have a question about the convolution implementation of sliding windows.
I understand what the lecture pointed out, that by converting FC layers to convolutional layers, I’ll get a dimension as output that summarizes the information of each window instead of a single value, so I don’t have to propagate the sliding window sequentially, as well as I can share the overlapping area, thus I can reduce the computational cost.
But I’m not sure how it works in practice.
I’m wondering how it’s possible to have the sliding window propagate simultaneously, rather than sequentially.
More specifically, I’m wondering how ConvNet partitions the image into window sizes and propagates them simultaneously when the input sizes of the training and test images are different, even though only the FC layer has been converted to a convolutional layer.
For example, if I build an architecture where only the FC layers are converted to convolutional layers, will the network automatically work with windows of size as same as the training input images and stride=2 like in the video?
If so, won’t it partition the image into windows if they are the same size?
Or do I need to add a new layer that performs the window task when the network receives the input?
Perhaps my question isn’t clear enough.
So my question is, does partitioning into windows and propagating them simultaneously require any special action on the input layer?
Or is that just the way ConvNet’s convolution and pooling work, so no change is needed in the input, and I just need to change the FC layers to get the wanted output shape?
The point is that we do not need to implement sliding windows directly. The properties of convolutions are better both in terms of efficiency and flexibility. No change is needed in the input.
Also note the point of Tom’s question: all the architectures we deal with here require that the size and type of all images (training, validation, test or images when the model is actually deployed) be the same.
In my mind, the sliding windows is mentioned as an ‘introductory concept’. Convolutions, in contrast, are able to look over everything in parallel all at once. I mean, personally, it is just a guess, but I think that is why they called it ‘YOLO’.
First, let me make sure I’m understanding the point correctly.
If I change FC layers to convolution layers, the convolution itself will give the same output as if I implemented sliding windows directly.
Is that correct?
If that’s correct, my question is, for the same 14 x 14 x 3 test image, does the architecture automatically add a yellow stripe to the image to border it, making it 16 x 16 x 3 as if it were padded?
(Because if it doesn’t, and propagates the original 14 x 14 x 3 size, then the output will be 1 x 1 x 4 instead of the 2 x 2 x 4 I want.)
So why does the way it works change, even though the input layer hasn’t changed?