Week 3 Yolo Doubt About Sliding Window

I watched “Bounding Box Prediction” Video and was left with some doubt.

Before this, we were taught about Sliding Window Convolutions and how to efficiently execute this operation. My understanding is regarding Sliding Window Convolutions :

  1. We train a CNN Model for nxn input(ignore the channels)
  2. Then we use this to **localize object(**only a single object can be localized, if multiple objects then it won’t work) by feeding into it a mxm input(m>=n and we implement efficient sliding window convolutions as explained to us) and suppose what we have now is an output of rxrxk, where k is the number of classes that the CNN Model was trained on and rxr is the total number of slidings of nxn possible on mxm image.

With this understand, I went for the Yolo algorithm’s understanding and found that :

  1. Instead of doing only a single object localization for an input image in our “traditional” sliding window convolutions, here in Yolo we are doing multiple object localization - bounding boxes around each object present in the image
  2. We are using the same Sliding Window Convolution that I described above - which means :
    2.1 We are **training CNN model on 19x19(**ignore channel) input image size for single object detection.
    2.2 Then for object detection(ie detecting and localizing multiple object in a single image) we are feeding 100x100 sized image into the CNN model to obtain bounded boxes for each window of **19x19(**since our CNN was trained on 19x9 sized image).
  3. This 19x19 is our hyperparameter that is for us to chose wisely.

So, as per my (obviously) flawed understanding of Yolo, Yolo is same as Sliding Window Convolution network.
Please help me correct where I am wrong(somewhere I am wrong and I am damn sure about it).
Please provide additional information too, that would be helpful

Hi ashish_learns,

The best explanation is can think of is provided in the original yolo paper itself which you can find here. As the authors write: “We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities”.

1 Like

Here is how I differentiate the two approaches:

Sliding windows subdivides the input image. YOLO does not.

Sliding windows runs forward propagation once for each subdivision. YOLO runs forward propagation once.

Sliding windows can only locate an object within the one subdivision input to forward propagation at a time. Therefore, objects that are larger than the input image subdivision or that overlap subdivision boundaries cause issues. Since YOLO does not subdivide the input image, it doesn’t matter how big or exactly where the objects are within the image. (Note, the image input to the original YOLO was 448x448 pixels. YOLO 9000 used 416x416 pixels. The 7x7 or 19x19 we talk about is the number of grid cells, not the number of pixels input to the CNN. Each grid cell covers hundreds to thousands of pixels)

Sliding windows detects one object per forward propagation. YOLO detects (grid cell count * grid cell count * anchor box count) objects per forward propagation (eg 19 * 19 * 5)

That YOLO could detect 1,800 objects significantly faster due to the single forward propagation, handle randomly positioned objects, and still deliver competitive accuracy is why it was so important.