Applying YOLO anchor boxes

Applying YOLO Anchor Boxes

A previous thread provides an introduction to extracting from labelled ground truth data a small set, on the order of 3 <= k <= 10 , of shapes for what is referred to in the YOLO 9000 paper as dimension clusters or priors, but in this class as anchor boxes . NOTE: Just to make things more confusing, the YOLO 9000 paper uses anchor boxes to refer instead to the hand-picked shapes described in the Faster R-CNN paper, not the ones generated by K-Means. The YOLO 9000 paper asserts that the set of shapes generated by a K-means algorithm is qualitatively better than other approaches. But why is that? Below is an introduction to how the K-means dimension clusters (from here on I’ll just call them anchor boxes) are used in YOLO object location predictions.

The motivation

Think back to what you have learned already about the steps of neural network training. You initialize weights, often randomly, run a forward propagation, then evaluate the difference between the output of the network and the ground truth data, likely using a Euclidean distance. When this occurs after the first pass, the predicted values are not likely to be anywhere close to the ground truth values. Best case, this requires extra iterations and epochs of training to overcome. Worst case, you end up in a local optimum and never get good results. The YOLO team experienced this when initializing randomly and directly outputting bounding box coordinate predictions in v1. To overcome this, YOLO 9000 predicts bounding box center location directly using the sigmoid activation function, but bounding box size using an offset from one of the K-means generated anchor box shapes. The latter is subtle and merits further exposition.

The implementation

The following figure depicts the bounding box location-related values that the YOLO CNN is outputting:

Bounding Box Center

In the digram above, t_x and t_y relate to the predicted center of the bounding box. They are operated upon by the sigmoid activation function (and thus constrained to be between 0 and 1) and then added to the image-relative offset of the containing grid cell ( c_x , c_y ) to generate the predicted center location b_x , b_y. c_x and c_y are integers between 0 and the number of grid cells. \sigma(t) is a floating point between 0 and 1. Therefore, b_x and b_y, the sum of \sigma(t) and the grid cell index, is a floating point between 0 and # of grid cells + 1. For a concrete example, an object with its center in the exact center of an image with 19 x 19 grid cells would have (b_x, b_y) = (8.5, 8.5). Notice from the equation provided in the diagram you can also express t_x as logit(b_x - c_x) and t_y = logit(b_y - c_y). This form is useful in defining ground truth values.

Bounding Box Shape

Similarly , t_w and t_h relate to the predicted size (shape) of the bounding box. These values are output directly by the network. Their values are defined relative to an anchor box shape. Notice in the diagram how the values are used to generate the bounding box size b_w, b_h . First, t_w and t_h are applied to the exponential function. The results e^{t_w} and e^{t_h} are multiplied by the anchor box shape p_w , p_h . Notice that if t_w or t_h equal 0, the exponential terms reduce to 1, and b_w and b_h will be equal to p_w and p_h, respectively (e^0 = 1). If t_w and t_h are greater than 0, the exponential term is greater than 1, and b_w and b_h will be greater than p_w and p_h (this works even if this results in values larger than the grid cell dimension – it is exactly the mechanism that allows YOLO to predict bounding box shapes larger than one grid cell). If t_w and t_h are less than 0, the exponential term is less than 1 (though still positive), and b_w and b_h will be greater than 0 but less than p_w and p_h .

You can see these steps occurring in the following lines of code from the yolo_head() function of

These two lines perform the initial operation on the features output by the network:

   box_xy = K.sigmoid(feats[..., :2])
   box_wh = K.exp(feats[..., 2:4])

These two lines finish the algebra. The first is adding the grid cell offset, the second multiplies by the anchor box dimensions.

    box_xy = (box_xy + conv_index) / conv_dims
    box_wh = box_wh * anchors_tensor / conv_dims

After this, all four predicted bounding box values are ready to use in the loss function.


This is an important characteristic of YOLO that bears close inspection. The YOLO CNN isn’t predicting bounding box shape directly. Instead, it is predicting a floating number that is used in the power in the natural exponential function and then multiplied by an anchor box shape. This product is what gets used within the loss function. Poorly chosen anchor box shapes result directly in more error for the optimizer to deal with. In the previous thread I showed that some of the BDD ground truth bounding box shapes were hundreds of pixels, while the anchor boxes selected by the naive approach were never more than 16. Extra work for gradient descent in order to overcome that 10x miss.

Also of note is that the ground truth bounding boxes are manipulated on the way in with the inverse of the exponential. Namely, the log() function, which is applied in the following lines of code from the preprocess_true_boxes() function in

                np.log(box[2] / anchors[best_anchor][0]),
                np.log(box[3] / anchors[best_anchor][1])

Ground truth is stored after taking the log() of the box width [2] and height[3] divided by the width [0] and height[1] of the best_anchor, which was determined using IOU between the anchors and the ground truth (true) box. If ground truth width and best anchor shape are equal, the ratio will be 1, then 0 = log(1) . Later we’ll have e^0 as the multiplicative factor, meaning the predicted shape will be the same as the anchor box.

Final Thought

Anchor box shapes determine which cell in the multi-dimensional ground truth an image object is ‘assigned’ to. They are used to preprocess and scale the ground truth data so that it can be compared after the exponential function has been applied to the network output. And the shape is used as a multiplicative factor applied before the predicted shape is passed to the loss function. Clearly, anchor boxes are woven deeply into the fabric of YOLO. They are not merely there to support detection of multiple objects per image, though they do directly impact that capability. Therefore it is important if you are running a YOLO implementation from darknet or a 3-rd party that you also define reasonable anchor box shapes because the code assumes that you have done. While you could choose to pick some randomly, or use a naive approach like sorting common shapes on occurrence, you should not expect those to yield the same quality results.

The following image is from the BDD val set. The blue boxes are the ground truth labels for the car objects. The red boxes are the best anchors, picked from the 8 dimension clusters I generated using K-means in the previous thread.

For comparison, here is the same image and blue ground truth boxes, but here the red anchor boxes were chosen by sorting on common shapes.

Finally, here is how the labelled ground truth gets mapped into the YOLO output shape with grid cells and anchor boxes before training, using just one of the objects for discussion.

In this 608x608 cropped version of the same image, the yellow boxes represent the 32x32 pixel grid cells when S=19. The blue box represents the ground truth for one of the cars (same as in the two images above). The small red dot represents the center of the labelled object, and the small red box represents the grid cell in which that center lies. The large red box represents the best anchor box for that shaped ground truth label, based on IOU. In the SxSxB training data input, all the cells will be 0 for this object except the one where the grid cell and anchor box are red in the picture. That is, grid cell (5,8) and anchor box (6). For that dimension the b_x and b_y values contain the offset of the red dot object center from the origin of the red square grid cell, while the b_w and b_h values contain the anchor box shape p_w, p_h times log of the ratio of the ground truth width and height to the best anchor width and height.

Hope this helps.


Thanks for contributing this.

Why divide by conv_dims?

At different places in the code of this YOLO implementation it uses grid cell-relative or image-relative coordinates and dimensions. These lines are scaling in one direction or the other. Haven’t looked at this code in a long time and can’t remember the value for conv_dims but since it is a division, I am guessing it is going down to grid cell-relative here. That would be consistent with the accompanying narrative that this is done in preparation for use in the loss function. YOLO v2 loss function wants all error components (location, shape, presence, class prediction) to be of the same order of magnitude. Without scaling down, center location pixels might be in the hundreds, right? Where shape, class prediction, and object presence are order of magnitude 1.


I finally went back and looked at the code. In the utility file you can find a line

conv_dims = K.shape(feats)[1:3]

where feats is short for features, the output of the YOLO CNN network. The shape of the output of the network for YOLO v2 is (m, S, S, B, 5 + C) where m is as usual the number of training examples, S is the grid cell count, B is the anchor box count, and C is the number of classes. Remember for this exercise, C=80, B=5, and S=19.

So the code in question is dividing the the predicted object bounding box center location x and y values and predicted bounding box size w and h values by 19. Which means scaling down from image size-relative to grid cell-relative values. HTH. @Pixies

I thought it was the other way around - that it scaled to the relative position of the image and not to its convolution - but it makes more sense this way. I didn’t look at Figure 3 closely
The output of the YOLO is (m, S, S, B*(5 + C))? i mean 4 dim or is 5 dim? because i thought yolo_head preprocess the output to match with the processed labels by preprocess_true_boxes, in order to use in the loss.
Thank you for taking the time to respond me. I’ve had a hard time understanding this part of the algorithm.

To the best of my knowledge, the output of the network itself is 5D. Also, in my (admittedly old) version of the exercise I find this …

3.3 - Convert output of the model to usable bounding box tensors

The output of yolo_model is a (m, 19, 19, 5, 85) tensor that needs to pass through non-trivial processing and conversion.

I also find this code inside the function yolo_loss()

    yolo_output_shape = K.shape(yolo_output)
    feats = K.reshape(yolo_output, [
        -1, yolo_output_shape[1], yolo_output_shape[2], num_anchors,
        num_classes + 5

So the feats matrix mentioned in this thread above is definitely 5D.

Collapsing further to (m, 19, 19, 5*85) might be advantageous for some purposes (like visualization), but really it’s the same numbers just reshaped, right? Whatever works best for the matrix math that needs to be done at that point in the algorithm. There are several places in this implementation where arrays are stacked/unstacked/reshaped for convenience of the math. Cheers