What are anchor boxes doing? week 3, assignment 1

In one of the lectures in week 3 Andrew explained about anchor boxes. My impression is that it seems like the main reason for using anchor boxes is to enable the algorithm to detect more than one object in each grid cell. but in the programming assignment, the yolo_filter_boxes functions output values are not separated based on different anchor boxes and when applying the yolo_non_max_suppression function on them the boxes are in a way as if there is no difference between 5 anchor boxes. so for example, if there are both a pedestrian and a car in a specific grid cell finally one of them will be selected as a prediction. I will appreciate it if you correct me if I’m wrong.


I think what you are missing is that this is all about multiple boxes being returned. There is no constraint that says there is only one per grid cell. The only constraint is that we use the threshold value to screen out hits that look weaker thus ending up with the highest confidence predictions. But it’s still the case that there can be more than one per grid cell and even more than one per anchor box within a given grid cell.

YOLO is by far the most complex network that we’ve seen so far. There is a lot to consider here in order to really “grok” what is going on. I make no claim to really understand it yet. One good thing to do if you want to go deeper is to look at some of the explanatory posts about YOLO from @Ai_curious here on Discourse. Here’s a good one to start with. And here’s another to follow up from that.

Each SxSxB location in the network output can contain only one 1+(2+2)+C predictions vector; one presence confidence p_c, one bounding box center location (b_x, b_y), one bounding box shape (b_w,b_h), and one class prediction vector c_i.

The YOLO network outputs one vector of predictions for each output location each complete forward propagation. An output location is the tuple of grid cell width offset, grid cell height offset, anchor box. For example 19x19x5 = 1,805 locations. And each location makes a prediction on presence, center, shape, and class. So for example (1+4+80=85) predictions for each location. Meaning (1,805 x 85 = 153,425) total predictions each and every forward pass.

All of these numerical values exist in the output structure, even if the confidence is low or the location or shape inaccurate. The neural net just produces that 4D matrix of numbers, and leaves it to the downstream processing to figure out what they mean and whether they are significant or not. Hope this helps. There is more detail and some graphical examples in the threads @paulinpaloalto links.

It is correct that by the time NMS is run, the detail of which grid cell and anchor box made the predictions is lost, however the inference that only one will be selected as a prediction is not.

Let’s assume the most confusing case, which is that the center of the person and the center of the car are in the exact same location in the image. During training, YOLO will have been told to use the taller-than-wide anchor box for clusters of pixels that look like a person, and the wider-than-tall anchor box for pixels that look like a car (remember anchor boxes have shapes only, not locations or class types) Let’s assume that it learned well and did the same thing during prediction and has a high confidence for both objects. Now one tuple (grid width offset, grid height offset, anchor box) contains the predicted location, shape, and class of the person and another tuple (same grid width offset, same grid height offset, different anchor box) contains the predicted location, shape, and class of the car. The two predicted center locations will be the same, but the two predicted shapes will not be. Thus, when the two predicted bounding boxes are compared inside NMS, the IOU will be low, meaning the shapes are dissimilar, and both would be kept.

Thank you for your explanation and the links you shared. :pray:

That was great. I think I got it. Thanks a lot for your comprehensive and clear explanation. :pray: :rose: