Number of anchor boxes

Shouldn’t the number of anchor boxes be the same as the number of classes to be detected? In the video of YOLO algo, the classes are pedestrian, car, motorcycles, but the number of anchors is 2.

Anchor boxes, referred to by the YOLO creators in their papers as dimension clusters or priors, have shape, but not type. They can be derived from the training set by various techniques, such as K-means. Here is a forum thread about how that is accomplished ==> [Deriving YOLO anchor boxes ]

Consider the implications of your suggestion. YOLO outputs 4 bounding box values, 1 object presence prediction, and a vector of class predictions for each of SxSxB detector locations. For the dimensions and data used in this class exercise, that amounts to 19x19x5x(4+1+80) = 153,425 predictions (grid size is 19, anchor box number is 5, 80 classes). If there were 80 anchor boxes, the number of predictions would be 16x higher… 2,454,800 predictions. For imagenet, with 1,000 classes, the number balloons to 362,805,000. Memory and computation is just not feasible.

Here is another related thread that discusses the role anchor boxes play in the output shape and the values of the bounding box shape predictions [Applying YOLO anchor boxes]

Maybe take a look at those links and let us know what you think?

1 Like

I have a very similar question like you. Where do I output the results for objects that don’t have anchor box? In the examples in the videos, there are 3 classes, but only 2 anchor boxes - one for a car and one for a pedestrian. It seemed that the position of the object’s location in the output vector is fixed. If the shape turns out to overlap the anchor box for a car, it’s always the lower 8 numbers. If it turns out to be a pedestrian, it is the upper 8 numbers. But what if it’s a motorcycle? Do I just assign it, for example, the upper 8 numbers knowing that the c1, c2, c3 values reveal the object type… ? Thanks :).

Another question is related to bounding boxes - I’m a little perplexed how well the algorithm can work with bounding boxes that go over the given grid piece. Especially when it’s the 19x19, so there’s just a small piece of the car in the middle-point grid… When I was watching the videos, it seemed that each grid-box is kind of looked at individually (for the object detection) - but I guess that’s not true and the network still looks at the neighboring gridboxes as well?

As mentioned above, and discussed extensively in the linked threads, in YOLO anchor boxes are not type or class-specific. There is not a car anchor box nor a pedestrian anchor box. Anchor box shapes are determined by exploratory data analysis on the training set, so if there are lots of training images with labelled objects like these, and you have decided to use 2 anchor boxes, you might end up with one wider-than-tall and one taller-than-wide. Labelled training objects would then be mapped to one or the other prior to training the model, while your Y matrix of labels is being set up. But the mapping is based on shape, not type.

In YOLO there are no objects that don’t have an anchor box. *

The situation during training data setup is described above. Every object in the labelled training data is mapped to the anchor box shape that is closest to the object’s ground truth bounding box shape. This is accomplished using IOU. Every object in the labelled training data gets assigned to an anchor box, which locates it within the ground truth matrix Y. There is no object in the training data that doesn’t have an anchor box assigned.

During forward propagation, the set of predictions for each grid cell + anchor box location is output by the neural net. Each location makes a prediction about an object the network predicts is centered at that grid location within the image, specifically whether an object is present or not, what class it is, where the bounding box center is, and what the bounding box shape is. There is no network output cell that doesn’t contain a prediction. Grid cell and anchor box are implicit because of the location in the prediction output matrix \hat{Y}. Thus there cannot be a predicted object without an anchor box.*

*The only time this assertion is broken is when there are more objects within one grid cell area of the input image than there are anchor boxes. A flock of birds, pebbles on a rocky beach, etc. If there are a large number of objects close together, YOLO might not localize all of them. But this is not related to mismatch between number of anchor boxes and known object class.

Notice that at runtime it is not the case that only the taller-than-wide anchor box makes a prediction detecting a pedestrian. The wider-than-tall anchor box can also make one simultaneously. It is just that the taller-than-wide anchor box location has been trained to detect the pedestrians in the training images and will be more effective at doing so.

I encourage you to spend a little time with the linked threads, and others that explore the same or similar questions. The one about objects bigger than one grid cell, for example, has also been discussed at length elsewhere. If you still have questions, come back here and the community can help drill down further.

Some related threads…

https://community.deeplearning.ai/search?q=Anchor%20boxes

1 Like

I’m sorry I am replying so late. Thank you for the answer (that I read when you wrote, so about a month ago :)). It cleared it up a lot! I read some of the threads you mentioned and those were helpful too.
I remember I wanted to ask this question but kept postponing it…but here it is. I hope it will make sense, since I already finished that course a few weeks ago so some things are a bit shady already :). But - do I understand it correctly that anchor boxes are used basically mostly in order to recognize more objects within one grid cell? In other words, if every grid cell had only one object, I guess they wouldn’t be needed, right?
Another question - if I remember correctly, the algorithm comes up with a shape and then it compares it to the anchor boxes. Would it make sense to just compare the, say, two boxes that the algorithm finds (of a human and of a car) and have some tool to measure if they are sufficiently different instead of trying to compare them to anchor boxes? Wouldn’t that take less time/computation? Thanks :slight_smile:

This is one function they fulfill. The other is that they act as initializers for bounding box shape predictions. Not quite literally, but experiments showed that introducing anchor boxes shaped by analysis on the training data set improved the stability of the model during localization training. This happens because YOLO doesn’t directly predict bounding box shapes. Instead, it predicts numbers that are applied as a scaling factor to the anchor boxes. It is detailed in one of my olde threads, I will look for it and provide a link.

Always risky to come right out and say X is how YOLO works, because it evolved over time. The lectures in the course sometimes gloss over which version they are talking about. The programming exercise on autonomous cars is based on V2, and for that I can say that predicted bounding boxes are never compared to anchor boxes.

When setting up the training data matrix the ground truth bounding boxes are compared to anchor box shapes using IOU in order to assign to a specific location in the ground truth matrix Y.

During training, the network outputs a bounding box prediction for each grid cell + anchor box location in \hat{Y}. Each predicted bounding box is then compared to the ground truth bounding box in the corresponding location of Y. The shape of the anchor box plays no role in this comparison (which is performed inside the loss function).

During runtime, after forward propagation, predicted bounding boxes are compared to each other, also using IOU. Predicted bounding boxes with a sufficiently high IOU are deemed duplicate and only the one predicted bounding box with the highest confidence is retained.

Summary

  • Pre-training - ground truth bounding box compared to anchor box
  • Training - ground truth bounding box compared to predicted bounding box
  • Runtime - predicted bounding box compared to other predicted bounding boxes after forward prop completes
  • Never - Anchor box compared to predicted bounding box

There are many previous threads related to anchor boxes.

This one has some discussion that overlaps significantly with this thread but might be a useful read / comparison. It includes the details of how anchor boxes (called priors in the YOLO v1 paper, hence the p in the equations) are used to compute predicted bounding box shape from the direct network outputs ( called t_w and t_h ).

Hope this helps