Does Each Output Box of YOLO Ignore Some of the Image?

Hi! I’ve been wondering about the YOLO algorithm and the outputs seem to only ever get information from a portion of the input image.

it appears that the top left output that would represent the top left square of the grid only ever sees the top left 14x14 portion of the image. Because there’s no fully connected layer, it appears that the top left output only sees computation done on that 14x14 part of the image, and never looks at the 2x2 elsewhere. Does each square of the YOLO algorithm ignore a few pixels of the original image, and if so does that cause any problems? If you have a 1000x1000 input image of a car that takes up the whole image and you have 100x100 grids, would your algorithm not work properly because each of the grids would be ignoring too many pixels of the original image?

1 Like

Do the images in this topic answer your question on a bounding box covering the full object?

Here’s my 2 cents. If you haven’t spent time with the original YOLO paper, you might want to, even though it glosses over some important details and some things have changed in subsequent editions since 2015. You can find it several places on the web. Here is one :

Below are some important ideas from that paper that might help differentiate YOLO from sliding windows and the convolutional implementation of sliding windows covered in that section of the lecture.

YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance.. From Section 1.

Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. From Section 2.

Notice on the architecture depiction below that the neural network input is the full input image size, 448x448, and not the grid cell shape or size. Information from the entire image is propagated through to the fully connected layers at the end.

The grid cell count and size are related to the convolutional downsampling, but they are not exactly the same things. The convolutional layers take the entire image and produce a feature map downsampled 32x in the original paper, in later versions extended to 32x, 16x, and 8x. The resulting feature map is then input to 2 fully connected layers, where the dimension of the last of the fully connected layers is the output of the overall network, and maps to the multidimensional structure based on number of grid cells, anchor boxes, and classes as S × S × (B ∗ 5 + C) per v1. Slightly different for v2 and the exercise in this course.

For v1, bounding box shapes are constrained by their activation function to be between 0 and 1, which is whole-image-relative. Thus a predicted bounding box of width = 1. and height=1. would be full image size (not constrained to grid cell size or shape.). Here is the relevant excerpt from the original paper:

Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Section 2.

emphasis added

The mechanism changes a bit for the later YOLO versions, but a bounding box is still able to exceed grid cell size, which is not the case for sliding windows. There are other threads in this forum that discuss the v2 mechanism in detail. Hope this helps.

Thank you for your response! I know that the full image is input and that the YOLO algorithm isn’t just an implementation of sliding windows, but I was wondering due to the nature of how each cell of a CNN corresponds to only a f x f portion of the previous layer if each grid cell would only ever end up seeing 95% of the original image and end up ignoring a couple of pixels on the sides. I was wondering if this was true and if there was any repercussions for it occurring or if it changed in later YOLO versions

1 Like

Sorry that I misunderstood the question. Most people ask how YOLO can detect an object larger than a grid cell, assuming the image is decomposed as it is in sliding windows.

Intentional selection of input and filter size, layer depth, and convolution stride and padding avoids data loss in a YOLO CNN. It is architected to exactly transform the entire input into the proper sized output.