Doubt regarding YOLO class prediction and anchor boxes

I cannot completely understand the intuition behind the anchor boxes in YOLO. So, according to the papers, each grid cell is responsible for predicting the class of the object inside it. So, the first question comes from here: since the grid is 19x19, won’t it contain too little information to understand the whole context of the object (especially in the case of large objects)? Also, since each cell is predicting the class, how does having different anchor boxes help, as the box will span across multiple grid cells, but the prediction is made based on the content in a particular grid cell (which is completely inside of the anchor box always)? This could only possibly make sense if the whole content inside the anchor box is taken into consideration to predict the class. Is it so?

Yes, that sounds right. A couple of key points to make are that a) bounding boxes and anchor boxes are related but they are not the same thing and b) that bounding boxes do not have to be contained within a single grid cell. The grid cells are just used to organize the output by assigning the objects to the cell that contains their centroid.

This is a pretty deep topic. YOLO is by far the most complex algorithm we’ve encountered in any of the DLS courses so far. There are some great threads explaining YOLO in more detail that are worth a look. Here’s one to get started on the role of anchor boxes.

I would say this is partly correct. It is a common misconception that the input image is divided into grid cells. It is not. Each grid cell is responsible for localizing (where is it) and classifying (what is it) objects that are predicted to be centered inside it. But there is no requirement that the predicted bounding be contained entirely within one grid cell. Maybe take a read through the several threads that have discussed these concepts in detail and let us know what you learn or what questions remain. Cheers

1 Like