### Applying YOLO Anchor Boxes

A previous thread provides an introduction to extracting from labelled ground truth data a small set, on the order of 3 <= k <= 10 , of shapes for what is referred to in the YOLO 9000 paper as *dimension clusters* or *priors,* but in this class as *anchor boxes* . NOTE: Just to make things more confusing, the YOLO 9000 paper uses *anchor boxes* to refer instead to the hand-picked shapes described in the Faster R-CNN paper, not the ones generated by K-Means. The YOLO 9000 paper asserts that the set of shapes generated by a K-means algorithm is qualitatively better than other approaches. But why is that? Below is an introduction to how the K-means dimension clusters (from here on I’ll just call them anchor boxes) are used in YOLO object location predictions.

#### The motivation

Think back to what you have learned already about the steps of neural network training. You initialize weights, often randomly, run a forward propagation, then evaluate the difference between the output of the network and the ground truth data, likely using a Euclidean distance. When this occurs after the first pass, the predicted values are not likely to be anywhere close to the ground truth values. Best case, this requires extra iterations and epochs of training to overcome. Worst case, you end up in a local optimum and never get good results. The YOLO team experienced this when initializing randomly and directly outputting bounding box coordinate predictions in v1. To overcome this, YOLO 9000 predicts bounding box center location directly using the sigmoid activation function, but bounding box size using an offset from one of the K-means generated anchor box shapes. The latter is subtle and merits further exposition.

#### The implementation

The following figure depicts the bounding box location-related values that the YOLO CNN is outputting:

*Bounding Box Center*

In the digram above, t_x and t_y relate to the predicted center of the bounding box. They are operated upon by the sigmoid activation function (and thus constrained to be between 0 and 1) and then added to the image-relative offset of the containing grid cell ( c_x , c_y ) to generate the predicted center location b_x , b_y. c_x and c_y are integers between 0 and the number of grid cells. \sigma(t) is a floating point between 0 and 1. Therefore, b_x and b_y, the sum of \sigma(t) and the grid cell index, is a floating point between 0 and # of grid cells + 1. For a concrete example, an object with its center in the exact center of an image with 19 x 19 grid cells would have (b_x, b_y) = (8.5, 8.5). Notice from the equation provided in the diagram you can also express t_x as logit(b_x - c_x) and t_y = logit(b_y - c_y). This form is useful in defining ground truth values.

*Bounding Box Shape*

Similarly , t_w and t_h relate to the predicted size (shape) of the bounding box. These values are output directly by the network. Their values are defined relative to an anchor box shape. Notice in the diagram how the values are used to generate the bounding box size b_w, b_h . First, t_w and t_h are applied to the exponential function. The results e^{t_w} and e^{t_h} are multiplied by the anchor box shape p_w , p_h . Notice that if t_w or t_h equal 0, the exponential terms reduce to 1, and b_w and b_h will be equal to p_w and p_h, respectively (e^0 = 1). If t_w and t_h are greater than 0, the exponential term is greater than 1, and b_w and b_h will be greater than p_w and p_h (this works even if this results in values larger than the grid cell dimension – it is exactly the mechanism that allows YOLO to predict bounding box shapes larger than one grid cell). If t_w and t_h are less than 0, the exponential term is less than 1 (though still positive), and b_w and b_h will be greater than 0 but less than p_w and p_h .

You can see these steps occurring in the following lines of code from the yolo_head() function of keras_yolo.py

These two lines perform the initial operation on the features output by the network:

```
box_xy = K.sigmoid(feats[..., :2])
box_wh = K.exp(feats[..., 2:4])
```

These two lines finish the algebra. The first is adding the grid cell offset, the second multiplies by the anchor box dimensions.

```
box_xy = (box_xy + conv_index) / conv_dims
box_wh = box_wh * anchors_tensor / conv_dims
```

After this, all four predicted bounding box values are ready to use in the loss function.

#### Takeaways

This is an important characteristic of YOLO that bears close inspection. The YOLO CNN isn’t predicting bounding box shape directly. Instead, it is predicting a floating number that is used in the power in the natural exponential function and then multiplied by an anchor box shape. This product is what gets used within the loss function. Poorly chosen anchor box shapes result directly in more error for the optimizer to deal with. In the previous thread I showed that some of the BDD ground truth bounding box shapes were hundreds of pixels, while the anchor boxes selected by the naive approach were never more than 16. Extra work for gradient descent in order to overcome that 10x miss.

Also of note is that the ground truth bounding boxes are manipulated on the way in with the inverse of the exponential. Namely, the *log()* function, which is applied in the following lines of code from the *preprocess_true_boxes()* function in keras_yolo.py:

```
np.log(box[2] / anchors[best_anchor][0]),
np.log(box[3] / anchors[best_anchor][1])
```

Ground truth is stored after taking the log() of the box width [2] and height[3] divided by the width [0] and height[1] of the best_anchor, which was determined using IOU between the anchors and the ground truth (true) box. If ground truth width and best anchor shape are equal, the ratio will be 1, then 0 = log(1) . Later we’ll have e^0 as the multiplicative factor, meaning the predicted shape will be the same as the anchor box.

#### Final Thought

Anchor box shapes determine which cell in the multi-dimensional ground truth an image object is ‘assigned’ to. They are used to preprocess and scale the ground truth data so that it can be compared after the exponential function has been applied to the network output. And the shape is used as a multiplicative factor applied before the predicted shape is passed to the loss function. Clearly, anchor boxes are woven deeply into the fabric of YOLO. They are not merely there to support detection of multiple objects per image, though they do directly impact that capability. Therefore it is important if you are running a YOLO implementation from darknet or a 3-rd party that you also define reasonable anchor box shapes because the code assumes that you have done. While you could choose to pick some randomly, or use a naive approach like sorting common shapes on occurrence, you should not expect those to yield the same quality results.

The following image is from the BDD val set. The blue boxes are the ground truth labels for the car objects. The red boxes are the best anchors, picked from the 8 dimension clusters I generated using K-means in the previous thread.

For comparison, here is the same image and blue ground truth boxes, but here the red anchor boxes were chosen by sorting on common shapes.

Finally, here is how the labelled ground truth gets mapped into the YOLO output shape with grid cells and anchor boxes before training, using just one of the objects for discussion.

In this 608x608 cropped version of the same image, the yellow boxes represent the 32x32 pixel grid cells when S=19. The blue box represents the ground truth for one of the cars (same as in the two images above). The small red dot represents the center of the labelled object, and the small red box represents the grid cell in which that center lies. The large red box represents the best anchor box for that shaped ground truth label, based on IOU. In the SxSxB training data input, all the cells will be 0 *for this object* except the one where the grid cell and anchor box are red in the picture. That is, grid cell (5,8) and anchor box (6). For that dimension the b_x and b_y values contain the offset of the red dot object center from the origin of the red square grid cell, while the b_w and b_h values contain the anchor box shape p_w, p_h times log of the ratio of the ground truth width and height to the best anchor width and height.

Hope this helps.