Questions about sliding window and YOLO

Hi all, I have some understandings as well as some confusions about sliding window and YOLO. I would like to share with you. Correct me if I got something wrong, and help me with my confusions. Thanks in advance!

To detect an object and output the bounding box, the sequential version of sliding window is the easiest to understand: scan through the image one part by one part using ALL kinds of window shapes. For each window position, do a classification and the output is simply 1 * 1 (ignoring the classes etc. ), which tells us a simple answer whether or not there is an object in that window. But this is computationally expensive, because the windows have overlaps, and if a CNN filter lies within that overlap region, that filter multiplication is done multiple times. A lot of waste!

Question 1: so are we using the windows as the bounding boxes (if we detects there is an object)?
If so, no wonder this algorithm is inaccurate: it does not calculate (bx, by, bw, bh).

Question 2: how to choose the window shapes and sizes? Besides trying them all, consider the case if a smaller windows captures an object, a larger one around same position will capture it again. How to choose between them for the bounding box?

Now we come to the idea of convolutional version of sliding window. We can save the computation results for each filter position and trace back which window it came from in the end! To do that, let’s first pick a window size and shape (lecture uses 14 * 14), somehow train a CNN which takes inputs same as the desired window shape. After done training, throw the original testing image which is of larger size (in lecture it is 16 * 16) into the trained CNN. Since now the input is larger than what was trained with, the output naturally becomes larger as well: it is no longer 1 * 1, but, 4 * 4 as in lecture, and each value corresponds to a window position telling us again yes or no for an object in each window position.

Question 3: why does Andrew introduce that FC is equivalent to a CNN layer with filter in same size as previous layer. Yes, that is true. But it does not simplify anything. Sequentially or convolutionally, both can use FC or CONV.

Question 4: the convolutional implementation of sliding window still does not tell us how to choose the window size or ratio. Looks like we still need to try them ALL! What even worse is that, we need to train a new CNN for EACH window size/ratio choice? This can be a lot of work!

Sliding window algorithm does not compute for the accurate bounding boxes. That’s where YOLO comes in. YOLO gets rid of the windows and actually computes (bx, by, bw, bh) for each box. Another good thing is that we no longer need to try all window shapes!

But the “window” idea does one extra good thing: a window makes sure there is at most one object within it. When there are multiple objects in the image, YOLO does not know in advance how many boxes to compute. That’s why YOLO needs to “divide” the image into smaller grid cells, so that at most one object in a cell.

In YOLO, “imagine” we first divide the original image into smaller grid cells, and apply the CONV sliding window algorithm to it. But we don’t actually do the “dividing”, instead, we just need to design a specific CNN architecture. For example, the input is 608608 (ignoring RGB), and we want to divide it into 1919 non-overlapping channels. 608/19=32, and that means we need to design a CNN which shrinks the input by a factor of 32, with the “window stride” equal to 19.

Question 5: this design doesn’t sound easy. Taking the example in sliding window, 16 * 16 → 4 * 4 and the window stride is 2, not 4! Is that true?

Question 6: Wanted to make sure the usefulness of this “dividing” scheme. In other words, what’s wrong if we treat the whole image as one big cell? My understanding is that, “dividing” helps us avoid the case where multiple objects in one cell, so each cell only needs to compute one (bx, by, bw, bh). To make it well defined, YOLO defines that the object belongs to the cell which has its center.

Question 7: now finally, how does YOLO learn how to predict the bounding box (bx, by, bw, bh)? Labeled (bx, by, bw, bh) and loss function, is that all?

Yes. Labelled inputs and loss function, same as any supervised learning problem. However, in addition to location (where is it) YOLO also simultaneously predicts object class (what is it) and presence (is an object there at all), so there are three components of the total loss. Each component of the loss function uses weighted sum-squared error where the weights are a multiplicative factor to account for imbalance in the background-to-object ratio (images are mostly sparse so ‘no object present’ prediction will overwhelm during training)

I’ll try to address the sliding windows questions in a later response if no one else has done.

Here’s a picture of the YOLO v2 loss function:

YOLO9000 full loss equation

x_i - \hat{x}_i and y_i - \hat{y}_i are the center location error
w_i - \hat{w}_i and h_i - \hat{h}_i are the shape error

C_i - \hat{C}_i is the class prediction error

p_i(c) - \hat{p}_i(c) is object presence error

\lambda_{coord} and \lambda_{noobj} are multiplicative weighting factors

S is the grid cell count (eg 19 in our exercise, 7 in the original paper) and B is the number of anchor boxes (eg 5 in our exercise, 2 in the original paper)

1 Like

This is somewhat correct. Grid cells are introduced to handle (not avoid) cases where multiple objects are in one image (not multiple objects on one cell). Anchor boxes are introduced to handle multiple objects in one grid cell.

This is true at training time. Given a ground truth bounding box for an object, the coordinates of the center of the object can be derived. From that, the YOLO grid cell in which the center occurs can be determined. During training that grid cell index is used as part of the calculation for learning to predict the b_x and b_y coordinates. Based on that training, during runtime each grid cell + anchor box makes its own prediction about whether there is an object centered at its location or not.

So during training, you could say that YOLO defines which grid cell the object center belongs to. During runtime, each grid cell + anchor box predicts whether or not an object is centered within it.

1 Like

I think it is hard to be specific about what ‘the sliding windows’ algorithm does regarding object scale and location because there are several pre-YOLO algorithms that share some common general characteristics but different specifics. A shared general characteristic is they used separate pipelines for the location and classification tasks and they ran those pipelines separately for image sub-regions. The principal drawback of all of these wasn’t that they were inaccurate, but that they were slow. YOLO merged the separate regression and classification pipelines into one. YOLO was competitive with its peers on accuracy but substantially faster. These paragraphs from the v1 paper summarize the approach well:

*Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image.

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.*

1 Like

Thanks for all of your replies! They helped. I know this is a long post, so extra thanks!

1 Like