A clarification about Image Classification and Localization Algorithm and YOLO


how can the “Image Classification and Localization Algorithm” localize the boundaries of any of these cars while a part of their “actual” boundary is outside of the grid cell (which the algorithms shouldn’t be able to see)?


YOLO is by far the most complicated system we’ve seen so far, so it’s no wonder that it takes some serious headscratching to understand. The point is not that the algorithm can’t see things outside of the current grid cell: the grid cells are just used to organize the computation. A given object will be reported only for the grid cell that contains its centroid, but there is no requirement that the bounding box of the object lies completely within the grid cell. The bounding box “is what it is”. Over the next couple of lectures and in the assignment, you’ll also see how they deal with the fact that the same object can be reported multiple times with slightly different bounding boxes. In all this, Prof Ng doesn’t really say much about how all this complexity gets trained, but it’s a safe bet that “it’s complicated”. :scream_cat:

If you have more detailed questions about any of this and want to go deeper, there are some great threads from fellow student ai_curious who has done some serious work using and studying YOLO and then writing about it. Here’s a good one to start on and this one is more specific to the question of multiple bounding boxes.

And the fact that YOLO can handle bounding boxes larger than a grid cell is a key differentiator and advantage over sliding windows. Sliding windows iteratively chops up the image and can lose parts of objects larger than the window or near its edge. YOLO processes the entire input image all at once. Each grid cell predicts whether an object center is within it, along with the center location and bounding box shape. Perfectly fine for bounding box dimension to exceed grid cell size.