Shouldn’t the number of anchor boxes be the same as the number of classes to be detected? In the video of YOLO algo, the classes are pedestrian, car, motorcycles, but the number of anchors is 2.

Anchor boxes, referred to by the YOLO creators in their papers as *dimension clusters* or *priors*, have shape, but not type. They can be derived from the training set by various techniques, such as K-means. Here is a forum thread about how that is accomplished ==> [Deriving YOLO anchor boxes ]

Consider the implications of your suggestion. YOLO outputs 4 bounding box values, 1 object presence prediction, and a vector of class predictions for each of SxSxB detector locations. For the dimensions and data used in this class exercise, that amounts to 19x19x5x(4+1+80) = 153,425 predictions (grid size is 19, anchor box number is 5, 80 classes). If there were 80 anchor boxes, the number of predictions would be 16x higher… 2,454,800 predictions. For imagenet, with 1,000 classes, the number balloons to 362,805,000. Memory and computation is just not feasible.

Here is another related thread that discusses the role anchor boxes play in the output shape and the values of the bounding box shape predictions [Applying YOLO anchor boxes]

Maybe take a look at those links and let us know what you think?