Does Each Output Box of YOLO Ignore Some of the Image?

Here’s my 2 cents. If you haven’t spent time with the original YOLO paper, you might want to, even though it glosses over some important details and some things have changed in subsequent editions since 2015. You can find it several places on the web. Here is one : https://homes.cs.washington.edu/~ali/papers/YOLO.pdf

Below are some important ideas from that paper that might help differentiate YOLO from sliding windows and the convolutional implementation of sliding windows covered in that section of the lecture.

YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance.. From Section 1.

Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. From Section 2.

Notice on the architecture depiction below that the neural network input is the full input image size, 448x448, and not the grid cell shape or size. Information from the entire image is propagated through to the fully connected layers at the end.

The grid cell count and size are related to the convolutional downsampling, but they are not exactly the same things. The convolutional layers take the entire image and produce a feature map downsampled 32x in the original paper, in later versions extended to 32x, 16x, and 8x. The resulting feature map is then input to 2 fully connected layers, where the dimension of the last of the fully connected layers is the output of the overall network, and maps to the multidimensional structure based on number of grid cells, anchor boxes, and classes as S × S × (B ∗ 5 + C) per v1. Slightly different for v2 and the exercise in this course.

For v1, bounding box shapes are constrained by their activation function to be between 0 and 1, which is whole-image-relative. Thus a predicted bounding box of width = 1. and height=1. would be full image size (not constrained to grid cell size or shape.). Here is the relevant excerpt from the original paper:

Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Section 2.

emphasis added

The mechanism changes a bit for the later YOLO versions, but a bounding box is still able to exceed grid cell size, which is not the case for sliding windows. There are other threads in this forum that discuss the v2 mechanism in detail. Hope this helps.