Object Detection with Sliding windows algorithm

I have a 2 main questions on this video : C4W3L03 Object Detection

Suppose I predict the picture with many cars in the road, and the algorithm run with 2 window size, and they both detect the same truck like the following

In this case, What should be the bounding box for this truck, Do we have to merge, or Intersect them, or just show both of 2 bounding boxs

Thank you in Advanced

Hey @wallik2,
If you continue with this week a little further, you will find that Prof Andrew has created a dedicated lecture to answer this very question, entitled “Non-max Suppression”. Let me quote an excerpt from the same lecture for your reference:

One of the problems of Object Detection as you’ve learned about this so far, is that your algorithm may find multiple detections of the same objects. Rather than detecting an object just once, it might detect it multiple times. Non-max suppression is a way for you to make sure that your algorithm detects each object only once.

Feel free to skip the intermediate lectures and watch that one, but I suggest you to follow sequentially, since there are some pre-requisites which Prof has discussed in these intermediate videos. I hope this helps.

Cheers,
Elemento

1 Like

I’m unclear about this reply. I know how YOLO deals with objects larger than a grid cell, but how does sliding windows deal with objects larger than its window? I thought that limitation was one of the reasons YOLO was invented in the first place?

Hey @ai_curious,

Just to make sure that we are on the same page, Sliding Windows uses a dynamic window size during the inference time. It starts with smaller window sizes, and gradually moves towards larger window sizes. Allow me quote an excerpt from the lecture “Object Detection”. I have trimmed this excerpt wherever I felt it redundant for our discussion:

If you have a test image like this what you do is you start by picking a certain window size. And then you would input into this ConvNet a small rectangular region. So, take just this below red square, input that into the ConvNet, and have a ConvNet make a prediction. And presumably for that little region in the red square, it’ll say, no that little red square does not contain a car. In the Sliding Windows Detection Algorithm, what you do is you then pass as input a second image now bounded by this red square shifted a little bit over and feed that to the ConvNet. So, you’re feeding just the region of the image in the red squares of the ConvNet and run the ConvNet again. And then you do that with a third image and so on. And you keep going until you’ve slid the window across every position in the image. But the idea is you basically go through every region of this size, and pass lots of little cropped images into the ConvNet and have it classified zero or one for each position as some stride. Now, having done this once with running this was called the sliding window through the image. You then repeat it, but now use a larger window. So, now you take a slightly larger region and run that region. So, resize this region into whatever input size the ConvNet is expecting, and feed that to the ConvNet and have it output zero or one. And then slide the window over again using some stride and so on. And you run that throughout your entire image until you get to the end. And then you might do the third time using even larger windows and so on.

So, the sliding windows algorithm tries windows as large as the entire image, so, there is pretty much no question of “it dealing with objects larger than it’s window”. Now, the major disadvantage even with the Convolutional implementation of Sliding windows, is that Sliding Windows performs Object Detection, but doesn’t predict the bounding box in terms of it’s explicit positioning. The methodology which we use to perform inference can tell us in which grid cell the object lies, and with what window size the algo was able to detect the object, and we can scale these back to get a crude prediction of bounding box and we will assume that window as the crude bounding box prediction, that’s it.

On the other hand, in the case of YOLO, there is prediction of bounding boxes in terms of their explicit positioning, for which, we need a dataset with a different kind of labelling as well, in which, we also have the bounding box coordinates as the labels. I hope this makes sense.

Cheers,
Elemento

1 Like

Thanks for that. I’m still a little fuzzy on how non-max suppression, which depends on IOU, which depends on bounding box location and shape, can help disambiguate potential duplicates in the sliding windows world, but I’ll noodle on it some more.

Hey @wallik2 and @ai_curious,
I just wanted to mention one more thing that I suppose I have missed out, which Nomen’s query has helped me to address. In the case of Sliding Windows, there is no discussion of multiple bounding box predictions for the same object. As I stated in my previous reply, the bounding box predictions produced by Sliding Windows are crude predictions, so, as good as we get a single bounding box prediction, we are good to go, and if we have multiple, then, we can simply consider them all.

The concept of Non-max suppression for eliminating overlapping bounding boxes, as Nomen pointed out depends on bounding box coordinates, and hence, is used with YOLO. I hope this helps clear up things better.

Cheers,
Elemento

1 Like

I also think the yellow and green bounding box ‘estimates’ drawn in the picture by the OP could not have derived from sliding only the corresponding yellow and green windows shown in the image. But I haven’t ever implemented a sliding windows solution myself so maybe I am missing something about sliding window size and object size.

Hey @ai_curious,
I believe that Saran has drawn these bounding boxes by himself for this image, but these can be drawn by the Sliding Windows algorithm as well. This is because, in Sliding Windows algo, we can take different window sizes with different dimensions (not necessarily square only), since irrespective of whatever the window size is, we always scale the input to fit the expectations of the trained ConvNet.

Cheers,
Elemento

Which leads back to my first post of this thread. My understanding of sliding windows is that no single forward propagation of sliding windows can predict a bounding box larger than its own window. How could it, since there is no information about what is outside its own dimensions? And if that is true, then the green window drawn in the picture above could not have produced the green box, and the yellow window could not have produced the yellow box. And non-max suppression might not be able to help resolve duplicate predictions if IOU is too low; it would only work for two window sizes closer to each other than the IOU threshold. Anyway, the point is rather moot since the OP accepted the first response as a solution so it’s only me unclear on how this works. I’ll get it eventually. Cheers.