Week 3 Yolo Doubt About Sliding Window

I watched “Bounding Box Prediction” Video and was left with some doubt.

Before this, we were taught about Sliding Window Convolutions and how to efficiently execute this operation. My understanding is regarding Sliding Window Convolutions :

  1. We train a CNN Model for nxn input(ignore the channels)
  2. Then we use this to **localize object(**only a single object can be localized, if multiple objects then it won’t work) by feeding into it a mxm input(m>=n and we implement efficient sliding window convolutions as explained to us) and suppose what we have now is an output of rxrxk, where k is the number of classes that the CNN Model was trained on and rxr is the total number of slidings of nxn possible on mxm image.

With this understand, I went for the Yolo algorithm’s understanding and found that :

  1. Instead of doing only a single object localization for an input image in our “traditional” sliding window convolutions, here in Yolo we are doing multiple object localization - bounding boxes around each object present in the image
  2. We are using the same Sliding Window Convolution that I described above - which means :
    2.1 We are **training CNN model on 19x19(**ignore channel) input image size for single object detection.
    2.2 Then for object detection(ie detecting and localizing multiple object in a single image) we are feeding 100x100 sized image into the CNN model to obtain bounded boxes for each window of **19x19(**since our CNN was trained on 19x9 sized image).
  3. This 19x19 is our hyperparameter that is for us to chose wisely.

So, as per my (obviously) flawed understanding of Yolo, Yolo is same as Sliding Window Convolution network.
Please help me correct where I am wrong(somewhere I am wrong and I am damn sure about it).
Please provide additional information too, that would be helpful

Hi ashish_learns,

The best explanation is can think of is provided in the original yolo paper itself which you can find here. As the authors write: “We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities”.

1 Like

Here is how I differentiate the two approaches:

Sliding windows subdivides the input image. YOLO does not.

Sliding windows runs forward propagation once for each subdivision. YOLO runs forward propagation once.

Sliding windows can only locate an object within the one subdivision input to forward propagation at a time. Therefore, objects that are larger than the input image subdivision or that overlap subdivision boundaries cause issues. Since YOLO does not subdivide the input image, it doesn’t matter how big or exactly where the objects are within the image. (Note, the image input to the original YOLO was 448x448 pixels. YOLO 9000 used 416x416 pixels. The 7x7 or 19x19 we talk about is the number of grid cells, not the number of pixels input to the CNN. Each grid cell covers hundreds to thousands of pixels)

Sliding windows detects one object per forward propagation. YOLO detects (grid cell count * grid cell count * anchor box count) objects per forward propagation (eg 19 * 19 * 5)

That YOLO could detect 1,800 objects significantly faster due to the single forward propagation, handle randomly positioned objects, and still deliver competitive accuracy is why it was so important.

Hi! I believe you were comparing the simple sliding windows algorithm with YOLO, not the convolutional implementation of sliding windows, is that correct? The convolutional approach processes the entire image in a single forward pass, similar to YOLO. Could you clarify how this differs from YOLO, and what advantages YOLO might offer over this method?

My reply above differentiates ‘simple’ sliding windows from YOLO. I feel like Convolutional Implementation of Sliding Windows (CISW) is a bit of an oxymoron, more confusing than helpful. This is because as shown on the video related to the OverFeat paper, there actually isn’t any sliding window in CISW, there is just Convolution. Convolution with stride, if you want. But CISW doesn’t subdivide the image and run forward prop multiple times, meaning it is indistinguishable from just Convolutions. So yes, YOLO looks pretty much like what is called CISW.

Note, however, that YOLO v1 did have fully connected layers following multiple convolutional layers, rather more like the first row of that slide than the lower ones.


The YOLO v1 architecture

Conv here means convolutional and Conn means connected. 448x448 image input, 7x7 grid, 7x7x30 output.

EDIT BELOW
In support of my assertion above that Convolutional Implementation of Sliding Windows (CISW) as described in the OverFeat paper is all convolution and no sliding windows, I attach this excerpt from the paper …

ConvNets and Sliding Window EfficiencyIn contrast to many sliding-window approaches that compute an entire pipeline for each window of the input one at a time, ConvNets are inherently efficient when applied in a sliding fashion because they naturally share computations common to overlapping regions. When applying our network to larger images at test time, we simply apply each convolution over the extent of the full image. …

As far as I can determine the sliding fashion describes convolutional strides, not a decomposition of the input into discrete regions. Welcome clarification if I missed the mark.

Since both the Convolutional Implementation of Sliding Windows (CISW) and YOLO process the entire image in a single forward pass, could you explain why YOLO is considered faster than CISW?

I did a quick review of the videos from this week and didn’t find a mention of YOLO out performing CISW. If you have one, please provide it. In the absence of that, we can fall back on the comparison provided by the YOLO authors in the v1 paper

OverFeat. Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [32]. OverFeat efficiently performs sliding window detection but it is still a disjoint system. OverFeat optimizes for localization, not detection performance. Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.

[32] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.

I haven’t seen a direct comparison of the two algorithms in terms of accuracy and throughput. The YOLO papers explicitly compare against several other contemporary approaches, but not OverFeat / CISW

I did a little more digging on this topic for this thread…