Shouldn’t the number of anchor boxes be the same as the number of classes to be detected? In the video of YOLO algo, the classes are pedestrian, car, motorcycles, but the number of anchors is 2.
Anchor boxes, referred to by the YOLO creators in their papers as dimension clusters or priors, have shape, but not type. They can be derived from the training set by various techniques, such as K-means. Here is a forum thread about how that is accomplished ==> [Deriving YOLO anchor boxes ]
Consider the implications of your suggestion. YOLO outputs 4 bounding box values, 1 object presence prediction, and a vector of class predictions for each of SxSxB detector locations. For the dimensions and data used in this class exercise, that amounts to 19x19x5x(4+1+80) = 153,425 predictions (grid size is 19, anchor box number is 5, 80 classes). If there were 80 anchor boxes, the number of predictions would be 16x higher… 2,454,800 predictions. For imagenet, with 1,000 classes, the number balloons to 362,805,000. Memory and computation is just not feasible.
Here is another related thread that discusses the role anchor boxes play in the output shape and the values of the bounding box shape predictions [Applying YOLO anchor boxes]
Maybe take a look at those links and let us know what you think?
I have a very similar question like you. Where do I output the results for objects that don’t have anchor box? In the examples in the videos, there are 3 classes, but only 2 anchor boxes - one for a car and one for a pedestrian. It seemed that the position of the object’s location in the output vector is fixed. If the shape turns out to overlap the anchor box for a car, it’s always the lower 8 numbers. If it turns out to be a pedestrian, it is the upper 8 numbers. But what if it’s a motorcycle? Do I just assign it, for example, the upper 8 numbers knowing that the c1, c2, c3 values reveal the object type… ? Thanks :).
Another question is related to bounding boxes - I’m a little perplexed how well the algorithm can work with bounding boxes that go over the given grid piece. Especially when it’s the 19x19, so there’s just a small piece of the car in the middle-point grid… When I was watching the videos, it seemed that each grid-box is kind of looked at individually (for the object detection) - but I guess that’s not true and the network still looks at the neighboring gridboxes as well?
As mentioned above, and discussed extensively in the linked threads, in YOLO anchor boxes are not type or class-specific. There is not a car anchor box nor a pedestrian anchor box. Anchor box shapes are determined by exploratory data analysis on the training set, so if there are lots of training images with labelled objects like these, and you have decided to use 2 anchor boxes, you might end up with one wider-than-tall and one taller-than-wide. Labelled training objects would then be mapped to one or the other prior to training the model, while your Y matrix of labels is being set up. But the mapping is based on shape, not type.
In YOLO there are no objects that don’t have an anchor box. *
The situation during training data setup is described above. Every object in the labelled training data is mapped to the anchor box shape that is closest to the object’s ground truth bounding box shape. This is accomplished using IOU. Every object in the labelled training data gets assigned to an anchor box, which locates it within the ground truth matrix Y. There is no object in the training data that doesn’t have an anchor box assigned.
During forward propagation, the set of predictions for each grid cell + anchor box location is output by the neural net. Each location makes a prediction about an object the network predicts is centered at that grid location within the image, specifically whether an object is present or not, what class it is, where the bounding box center is, and what the bounding box shape is. There is no network output cell that doesn’t contain a prediction. Grid cell and anchor box are implicit because of the location in the prediction output matrix \hat{Y}. Thus there cannot be a predicted object without an anchor box.*
*The only time this assertion is broken is when there are more objects within one grid cell area of the input image than there are anchor boxes. A flock of birds, pebbles on a rocky beach, etc. If there are a large number of objects close together, YOLO might not localize all of them. But this is not related to mismatch between number of anchor boxes and known object class.
Notice that at runtime it is not the case that only the taller-than-wide anchor box makes a prediction detecting a pedestrian. The wider-than-tall anchor box can also make one simultaneously. It is just that the taller-than-wide anchor box location has been trained to detect the pedestrians in the training images and will be more effective at doing so.
I encourage you to spend a little time with the linked threads, and others that explore the same or similar questions. The one about objects bigger than one grid cell, for example, has also been discussed at length elsewhere. If you still have questions, come back here and the community can help drill down further.
Some related threads…