I understand the anchor boxes are chosen from a pre-defined set of anchor boxes that accommodate the various shape ratios (car, pedestrian,…). Why can’t the height and width of the anchor box be learnt as it is with bounding boxes in classification with localization for instance?

They are. Not exactly as bounding boxes, though. Bounding boxes are learned through supervised learning. Anchor boxes are learned through unsupervised learning. See if this thread helps…

Thank you for your reply. From your explanation and the YOLOv2 paper I understand how a set of anchor boxes can be derived using IOU as a metric in the K-means algorithm. My poorly worded question was more about the need of predefining such a set of boxes i.e. the need of labelling the examples with boxes coming from that set of priors as opposed to using the original and more numerous boxes. I think the answer lies in the YOLOv2 paper under the paragraph “Convolutional With Anchor Boxes”, which appears to be an addition compared to the first YOLO paper.

That plus the following section **Dimension Clusters**.

I’m still not sure we’re on the same wavelength here. The bounding boxes used for training are the ‘original and more numerous boxes.’ The example images and their labels aren’t changed at all by the analysis of the anchor boxes or ‘priors’ in YOLO. The number of anchor boxes determines (in part) the network output shape. The anchor box shapes influence the bounding box shape predictions (through the equations provided in the **Direct location predictions** subsection of the **Dimension Clusters** section of the YOLO9000 paper). But the anchor box shapes aren’t part of the of the labels or replace them; they can’t do, as anchor boxes have only shape but no location.

Thank you for your response - I missed the point in the previous reply indeed, I was on a wrong path. Under **Direct location prediction** subsection I now understand that the widths (*bw*) and heights (*bh*) of the bounding boxes are learnt through the scaling of the anchor boxes widths (*pw*), and heights (*ph*), respectively, using the positive factors (the exponentials) highlighted in that subsection. So, as far as widths and heights are concerned, what the algorithm learns are those positive scaling factors through the learning of *tw* and *th*. Such a training is carried out using indeed the ‘original and more numerous boxes’. I think I now see the role of the prior anchor boxes and how they are used.

There are a lot of ideas packed into those seemingly simple expressions for b_w and b_h but it seems like you’ve got it.