Where are Anchor Boxes used?

Here’s an explanation that assumes v2

Think of the grid cells and anchor boxes as a set of separate detectors each of which requires training in order to make predictions. For v2, that is S*S*B detectors. Eg 19*19*2= 722 separate prediction vectors produced per image per forward propagation of the network.

Training
During training, most of the ground truth vectors are all zero - there is nothing in the image those detectors are ‘responsible’ for predicting. Where there is an object, the correct grid cell indices are determined from the ground truth bounding box center position. That puts you into the right cell. The correct anchor box index is determined using IOU of the ground truth bounding box shape with the anchor box shapes.

Suppose you have a 19x19 grid; two anchor boxes, one wide and low, one tall and skinny; and a person right in the center of a training image. The grid cell index would be (8,8) - center cell of a 19x19 grid. The anchor box index would be 1, since the IOU of the person object ground truth bounding box can be assumed higher with the tall and skinny anchor box than for the wide and low anchor box. This example therefore would have a 19x19x2 grid of vectors that was zero everywhere except at (8,8,1). At that location the prediction vector would be filled out with a 1 for object present, the correct center indices, the correct object shape values, and the correct class index (note it need not be 1 since there will likely be far fewer anchor boxes than class types and no reason to assume the indices synch).

Training now proceeds as expected, with backprop driving predicted values towards the ground truth values, meaning towards zero everywhere but at the (8,8,1) detector.

Note that if the car and person are both in the center of the image, then both the (8,8,0) and the (8,8,1) locations would have non-zero \hat{y}

Prediction
As implied above, during forward prop every detector produces an output vector based on its training. For the example image discussed, if training has gone well, all but two detectors will predict either no object or at least one with low confidence that thresholds out. You end up with two object detections, one for the car, and one for the person. Note that we will assume that these have low IOU, even though their centers are colocated, because their shapes are distinct. Thus they will both survive non-max-suppression.

Anchor box shape and predicted bounding box shape
Anchor box shapes also play a significant role in the generation of the shape predictions. The values the YOLO CNN is trained to produce are not actually the object shape in and of themselves. Rather, they are scaling factors that are multiplied with the anchor box shape to produce a shape prediction. The relationship between the network outputs, lets call them t_w and t_h, and the predicted bounding box shape, b_w and b_h is:

b_w = p_w * e^{(t_w)}
b_h = p_h * e^{(t_h)}

Where p_w and p_h are the anchor box, or prior, shape. If the ground truth shape is exactly the same as an anchor box shape, the network should be predicting 0 values for t_w and t_h.

Notice that these expressions facilitate bounding box predictions even much larger than the grid cell size, regardless of the anchor box shapes, a related question many people ask.

Hope this helps