YOLO vs Convolutional Sliding Window

While I have read that OverFeat paper referenced by Prof Ng in the discussion of Convolutional Implementation of Sliding Windows (CISW), but I haven’t ever worked with an implementation of it. I have spent a lot of time working to understand YOLO and implementing it myself from scratch. Here is my compare/contrast.

First, the basic network architecture of multiple convolutional layers successively downsampling the single input image is comparable. YOLO v1 had two fully connected layers after the convolutional layers, but I think they were removed in favor of a fully convolutional implementation with v2.

The biggest difference I see is in the number of predictions made, and what drives the number of them. In CISW, it appears that there can be a very large number of predictions made per object, and that this number is driven by the filter size (and stride?) of the convolutions.

This gaggle of bounding boxes is then “resolved” based on confidence down to a single box.

In YOLO, the number of predictions is driven by the grid size (S) and anchor box count (B). For each grid cell sized image region there can be at most B predictions, which I think is substantially lower than with CISW. Then, possible duplicates are pruned using non-max suppression. My understanding is that this results in at least as good localization and classification accuracy with much higher throughput. This explains why YOLO caught the world’s attention for years while CISW was merely a step that was rather quickly subsumed and surpassed.

Welcome your thoughts and corrections.

OverFeat paper on arxiv
https://arxiv.org/pdf/1312.6229

YOLO v1 on arvix
https://arxiv.org/pdf/1506.02640

NOTE: the YOLO authors briefly mention/contrast with OverFeature and reference it in their bibliography

NOTE: the architectures of the ConvNets in each of these papers owes debt to the seminal paper ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton

Krizhevsky et al paper at ACM
https://dl.acm.org/doi/pdf/10.1145/3065386

2 Likes