Hey @ai_curious,
Just to make sure that we are on the same page, Sliding Windows uses a dynamic window size during the inference time. It starts with smaller window sizes, and gradually moves towards larger window sizes. Allow me quote an excerpt from the lecture âObject Detectionâ. I have trimmed this excerpt wherever I felt it redundant for our discussion:
If you have a test image like this what you do is you start by picking a certain window size. And then you would input into this ConvNet a small rectangular region. So, take just this below red square, input that into the ConvNet, and have a ConvNet make a prediction. And presumably for that little region in the red square, itâll say, no that little red square does not contain a car. In the Sliding Windows Detection Algorithm, what you do is you then pass as input a second image now bounded by this red square shifted a little bit over and feed that to the ConvNet. So, youâre feeding just the region of the image in the red squares of the ConvNet and run the ConvNet again. And then you do that with a third image and so on. And you keep going until youâve slid the window across every position in the image. But the idea is you basically go through every region of this size, and pass lots of little cropped images into the ConvNet and have it classified zero or one for each position as some stride. Now, having done this once with running this was called the sliding window through the image. You then repeat it, but now use a larger window. So, now you take a slightly larger region and run that region. So, resize this region into whatever input size the ConvNet is expecting, and feed that to the ConvNet and have it output zero or one. And then slide the window over again using some stride and so on. And you run that throughout your entire image until you get to the end. And then you might do the third time using even larger windows and so on.
So, the sliding windows algorithm tries windows as large as the entire image, so, there is pretty much no question of âit dealing with objects larger than itâs windowâ. Now, the major disadvantage even with the Convolutional implementation of Sliding windows, is that Sliding Windows performs Object Detection, but doesnât predict the bounding box in terms of itâs explicit positioning. The methodology which we use to perform inference can tell us in which grid cell the object lies, and with what window size the algo was able to detect the object, and we can scale these back to get a crude prediction of bounding box and we will assume that window as the crude bounding box prediction, thatâs it.
On the other hand, in the case of YOLO, there is prediction of bounding boxes in terms of their explicit positioning, for which, we need a dataset with a different kind of labelling as well, in which, we also have the bounding box coordinates as the labels. I hope this makes sense.
Cheers,
Elemento