Why do we need grids in YOLO algorithm when the bounding box can go outside the grid?

There are several distinct questions here. First, every neural net learns exactly the same way, regardless of what they are predicting. Second, every neural net produces exactly the same output - numbers - regardless of what those numbers represent to humans. Third, grids are not needed or particularly helpful for predicting one bounding box location in an image, but are one approach to dealing with predicting multiple bounding box locations in an image.

Take the first point first. Neural network learning is driven by computation of error between a known value, the label, and a predicted value. Make a prediction, compare to the correct value, adjust parameters in a direction that hopefully will reduce that error, repeat. Details vary on the mechanism for computing the error and how to manage changes to the parameters, but the overall approach is the same.

Second, that process is followed regardless of what the numbers represent. Meaning bounding box center and shape are ‘learned’ exactly the same way a cat/non-cat classification was made. We tell the neural net what the ‘correct’ output values are, and it tries to reproduce them. The neural net doesn’t know or care that in some cases we interpret that 1 as confidence that the image contains a cat and other times that 1 is the grid offset of an object center.

FInally, think about your proposal a little deeper. Sure, a network can be trained to predict 4 values from a picture with a single object. But which 4 values would it predict for this image?

Screen Shot 2021-09-19 at 9.51.43 AM

YOLO uses the grid to concurrently make multiple sets of 4 value predictions so that it can handle images with multiple objects.

This thread has a longer explanation of grid cells and anchor boxes in YOLO. Let us know if it helps?

1 Like