YOLO vs Convolutional Sliding Window

I can’t distinguish the differences between YOLO and Convolutional implementation of Sliding Window; Except YOLO states a convenient number for slices or grids to divide the image into (number of output pixels regardless of channels (landmarks, #number of classes, …)).

Is YOLO another name for Convolutional Sliding Windows?

Here have a look on these threads which describe the same question:

1 Like

That is not helpful actually!

My question lies on Convolution Sliding Window and not Sliding Window and it is not the same question as I had a glance at those threads.

But thanks anyway…

1 Like

Going through convnet course now. Even I cant understand difference between them given that in convolutional implementation of sliding window we pass entire image in one iteration itself. Got any clarifications about it yet?

While I have read that OverFeat paper referenced by Prof Ng in the discussion of Convolutional Implementation of Sliding Windows (CISW), but I haven’t ever worked with an implementation of it. I have spent a lot of time working to understand YOLO and implementing it myself from scratch. Here is my compare/contrast.

First, the basic network architecture of multiple convolutional layers successively downsampling the single input image is comparable. YOLO v1 had two fully connected layers after the convolutional layers, but I think they were removed in favor of a fully convolutional implementation with v2.

The biggest difference I see is in the number of predictions made, and what drives the number of them. In CISW, it appears that there can be a very large number of predictions made per object, and that this number is driven by the filter size (and stride?) of the convolutions.

This gaggle of bounding boxes is then “resolved” based on confidence down to a single box.

In YOLO, the number of predictions is driven by the grid size (S) and anchor box count (B). For each grid cell sized image region there can be at most B predictions, which I think is substantially lower than with CISW. Then, possible duplicates are pruned using non-max suppression. My understanding is that this results in at least as good localization and classification accuracy with much higher throughput. This explains why YOLO caught the world’s attention for years while CISW was merely a step that was rather quickly subsumed and surpassed.

Welcome your thoughts and corrections.

OverFeat paper on arxiv
https://arxiv.org/pdf/1312.6229

YOLO v1 on arvix
https://arxiv.org/pdf/1506.02640

NOTE: the YOLO authors briefly mention/contrast with OverFeature and reference it in their bibliography

NOTE: the architectures of the ConvNets in each of these papers owes debt to the seminal paper ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton

Krizhevsky et al paper at ACM
https://dl.acm.org/doi/pdf/10.1145/3065386

2 Likes

After going through the YOLO V1 paper and the Convolutional implementation of Sliding Window (CISW) paper, here is my take about YOLO V1:

It does NOT apply sliding window. Instead, it’s a Conv + FC net that tries to predict bounding boxes in a 7x7 grid, that’s it. Each grid cell will detect the existence of 2 bounding box whose center is within the grid cell, and outputs a [p_1, bx_1, by_1, bw_1, bh_1, p_2, bx_2, by_2, bw_2, bh_2, c1, ... c20] , a 7x7x30 tensor.

For those who are curious:

The architecture starts off with conv layers, and ends with 2 fully connected (FC) layers. In total, 24 Conv Layers. The 1x1 convolutions reduce the feature space from preceding layers. This is very interesting. The first FC layer is connected to the flattened output of the last Conv layer. The second FC layer is reshaped into 7x7x30.

Loss calculation: In training, when calculating loss, one channel (vector of 30) is broken into two bounding boxes. Then both bounding boxes are compared against the groundtruth bounding box(es). The ones with the highest Intersection Over Union (IoU) are “responsible” for the corresonding groundtruth bounding box(es). Then, loss can be calculated by adding the weighted confidence loss and localization loss:

**Further spoiler from the paper: **
During model training, the first 20 layers were first trained with on the ImageNet dataset. They were appended with an avg pool layer and a FC layer. This process took Redmon et al. approx. a WEEK. Then, they learned from Ren et al. That adding both Conv and FC layers can improve performance. So they added 4 Conv layers with 2 FC layers. Those have randomly initialized weights.

1 Like

Thanks for that observation. What about the architecture of OverFeat ? I thought the point of the Convolutional Implementation of…’ part was that it also does NOT apply sliding window, at least in the traditional sense of running forward propagation sequentially on input subregions. If correct, the question remains, what distinguishes the two architecturally, and what leads to YOLO v1 superior throughput? Maybe the way the FC layers are connected and invoked?

The OverFeat architecture from their paper…

The YOLO v1 architecture from their paper…

The Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton ImageNet architecture that informed both of the above…

All three have 5 convolutional layers followed by two fully connected; the similarities are far stronger than their differences, at least in the layer architecture itself. To the best of my understanding, none of them are considered “sliding windows”. Thoughts?

This is a great question. My understanding is that in OverFeat, one row in the output tensor can be directly interpreted as the confidence over output classes. E.g., in the case of 16x16x3 input, the output is 2x2x4. The output gives the confidence the bounding box of size 14x14 at all 4 possible locations in the image, in just 1 pass.

However, in YOLO V1, each row in the output 7x7x30 corresponds to the [location, confidence] of each grid cell (or subregion), so I believe it’s equivalent to “looking at windows” but without sliding.

what distinguishes the two architecturally, and what leads to YOLO v1 superior throughput?
These are also great questions. architecturally:

AlexNet: Conv+max|Conv+max|Conv|Conv+max|Dense|Dense|Dense|

OverFeat: Conv+max|Conv+max|Conv|Conv+max|Dense|Dense|Dense|

YOLO V1: Conv+max|Conv+max|bottleneck Conv + max block | bottleneck Conv+max block | bottleneck Conv + max block | Conv Block | Dense | Dense |

So OverFeat and AlexNet’s architectures looks very similar. AlexNet is just image classification, while Overfeat is image classification + object detection.

YOLO V1 (24 conv layers + 2 FC layers) is larger than AlexNet and OverFeat. 1 forward pass in Overfeat is equivalent to sliding one fixed-size window. YOLO V1 however, can output window of any size (the network will learn the window size from the training data)

?

maybe can output predicted bounding box of any size ? Not clear to me what is meant by learned output window otherwise, as YOLO v1 grid size is static and not learned by an algorithm.

For Overfeat it is my similar understanding that filter size is fixed, while bounding box coordinates are learned and predicted by the localization regression layers that are fed output of the feature extractor Conv layers trained initially for the classification task.

Great question. Yolo v1 grid size is static, but the output of one grid cell contains the sizes of two bounding boxes, which are learnable. This is from the YOLO V1 Paper

Each bounding box consists of 5 predictions: x, y, w, h,
and confidence. The (x, y) coordinates represent the center
of the box relative to the bounds of the grid cell. The width
and height are predicted relative to the whole image. Finally
the confidence prediction represents the IOU between the
predicted box and any ground truth box.

For Overfeat it is my similar understanding that filter size is fixed, while bounding box coordinates are learned and predicted by the localization regression layers that are fed output of the feature extractor Conv layers trained initially for the classification task.

For OverFeat, I agree that the filter size is fixed. However a single output tensor element [class1, class2, class3, class 4] is simply the image recognition confidence across 4 classes of the corresponding window. The who tensor represents image recofnition confidence across all locations of the window, instead it makes use of convolution to generate a result equivalent to that of the sliding window method

I think we’re pretty much in agreement except for this statement above. See the Overfeat paper section…

4 Localization
Starting from our classification-trained network, we replace the classifier layers by a regression network and train it to predict object bounding boxes…

However on re-reading I do see that they are still predicting class confidence as you mention. I don’t see an explicit description of what the predicted bounding box shape is though, so infer that it is equivalent to the convolution filter size. I was mislead by the proliferation of predicted bounding boxes shown in Figure 6 and Figure 7 in the paper and inferred that there were many different predicted shapes. Now I think perhaps it is many locations but all the same shape for a given scale, though I have never looked at their code so still not completely not sure.

EDIT: see below that this supposition (above) is not correct. Overfeat does predict (different) bounding box shapes, just through a separately trained and invoked localization head that ingests the feature map from the convolutional layers.

Starting from our classification-trained network, we replace the classifier layers by a regression network and train it to predict object bounding boxes…

Ah, I see, I do take back that “there is no regression network”, this is a great observation. I found some slides from Stanford CS231. It says “the fully connected layer of the classifier is replaced by a regressor”. Then freeze the classifer, and train the network again on labeled input with bounding boxes.

The regressor will finally output [(x,y) of top left, top right corners] .
The regressor network is as follows:

After a quick scan, I didn’t see the how training is done in the first author, Sermanet’s C++ implementation though

So to figure out what the regressor network really look like, I found this YouTube Video that came up with an explanation that looks mostly reasonable to me, but please take it with a grain of salt

I’m not sure about the 1x1x4096 implementation though

My understanding of the regressor network is:

  • First train the classification network. Freeze it and add the localizer network in.
  • Layer 1 in the regressor has input 6x7x256, output 2x3x4096, so I believe it’s 5x5x256x4096 (Convolutional)
  • Layer 2: input: 2x3x4096, output 2x3x1024. This looks like an 1x1x1024 convolution to me?
  • Output layer: input: 2x3x1024, output 2x3x4. So this can be also achieved by 1x1x4 convolution?

I put question marks at the 1x1 convolution is 1x1 conv was introduced by Network In Network (NiN) Architecture (Lin et al., 2013), and was heavily used in GooLeNet in 2014. However I don’t see either of these in the OverFeat paper’s reference.

1 Like

I am really appreciative of the continued deep dive here. To the best of my knowledge, this thread is the only deep dive on the Overfeat paper and architecture on this platform (or the Coursera forum that preceded it). I found the narrative description in the paper hard to follow, and without the code felt unclear about what the flow actually is. But I think the diagram that shows the fully connected layers being run twice, once for class prediction (C + 1) and once for bounding box edges is 1) consistent with the submitted paper and 2) suggests why YOLO is faster at runtime (or test time as Sermanet et al call it). Namely, YOLO combines classification and localization predictions into a single complete forward pass; the fully connected layers aren’t run twice as it appears they are here. Thanks for sharing your thoughts.

Thank you for being consistent on the dive, @ai_curious! Your questions definitely helped me (and others in this forum, maybe) think further. OverFeat is one of the earliest works in the object detection field and now there are better models. But it’s still helpful to gain a reasonably good understanding of its architecture.