Hi all, I have some understandings as well as some confusions about sliding window and YOLO. I would like to share with you. Correct me if I got something wrong, and help me with my confusions. Thanks in advance!

To detect an object and output the bounding box, the sequential version of sliding window is the easiest to understand: scan through the image one part by one part using ALL kinds of window shapes. For each window position, do a classification and the output is simply 1 * 1 (ignoring the classes etc. ), which tells us a simple answer whether or not there is an object in that window. But this is computationally expensive, because the windows have overlaps, and if a CNN filter lies within that overlap region, that filter multiplication is done multiple times. A lot of waste!

Question 1: so are we using the windows as the bounding boxes (if we detects there is an object)?

If so, no wonder this algorithm is inaccurate: it does not calculate (bx, by, bw, bh).

Question 2: how to choose the window shapes and sizes? Besides trying them all, consider the case if a smaller windows captures an object, a larger one around same position will capture it again. How to choose between them for the bounding box?

Now we come to the idea of convolutional version of sliding window. We can save the computation results for each filter position and trace back which window it came from in the end! To do that, letâs first pick a window size and shape (lecture uses 14 * 14), somehow train a CNN which takes inputs same as the desired window shape. After done training, throw the original testing image which is of larger size (in lecture it is 16 * 16) into the trained CNN. Since now the input is larger than what was trained with, the output naturally becomes larger as well: it is no longer 1 * 1, but, 4 * 4 as in lecture, and each value corresponds to a window position telling us again yes or no for an object in each window position.

Question 3: why does Andrew introduce that FC is equivalent to a CNN layer with filter in same size as previous layer. Yes, that is true. But it does not simplify anything. Sequentially or convolutionally, both can use FC or CONV.

Question 4: the convolutional implementation of sliding window still does not tell us how to choose the window size or ratio. Looks like we still need to try them ALL! What even worse is that, we need to train a new CNN for EACH window size/ratio choice? This can be a lot of work!

Sliding window algorithm does not compute for the accurate bounding boxes. Thatâs where YOLO comes in. YOLO gets rid of the windows and actually computes (bx, by, bw, bh) for each box. Another good thing is that we no longer need to try all window shapes!

But the âwindowâ idea does one extra good thing: a window makes sure there is at most one object within it. When there are multiple objects in the image, YOLO does not know in advance how many boxes to compute. Thatâs why YOLO needs to âdivideâ the image into smaller grid cells, so that at most one object in a cell.

In YOLO, âimagineâ we first divide the original image into smaller grid cells, and apply the CONV sliding window algorithm to it. But we donât actually do the âdividingâ, instead, we just need to design a specific CNN architecture. For example, the input is 608*608 (ignoring RGB), and we want to divide it into 19*19 non-overlapping channels. 608/19=32, and that means we need to design a CNN which shrinks the input by a factor of 32, with the âwindow strideâ equal to 19.

Question 5: this design doesnât sound easy. Taking the example in sliding window, 16 * 16 â 4 * 4 and the window stride is 2, not 4! Is that true?

Question 6: Wanted to make sure the usefulness of this âdividingâ scheme. In other words, whatâs wrong if we treat the whole image as one big cell? My understanding is that, âdividingâ helps us avoid the case where multiple objects in one cell, so each cell only needs to compute one (bx, by, bw, bh). To make it well defined, YOLO defines that the object belongs to the cell which has its center.

Question 7: now finally, how does YOLO learn how to predict the bounding box (bx, by, bw, bh)? Labeled (bx, by, bw, bh) and loss function, is that all?