Where are Anchor Boxes used?

I was hoping for some clarification on anchor boxes as I’m a bit confused.

I don’t understand how we get them into the algorithm. Presumably the truth data must contain (nominally unique) bounding boxes for the objects in the image, i.e. not just a best fit chosen from our set of anchor boxes.

So then how do we force the algorithm to pick an anchor box for a given object? For instance if we have a person and a car in the same grid cell, and maybe we have two different shaped anchor boxes, so that the vector y contains two sets of pc, bx, by, … etc, say for the car first and person second, how do we force the algorithm to pick the tall narrow anchor box and test it against the truth data for the person, and the wide one for the car?

First you need to clarify whether you are asking about at training time or during operational/prediction. The answer is not the same for both. The answer is also different for different YOLO versions. Are you asking about v2?

Here’s an explanation that assumes v2

Think of the grid cells and anchor boxes as a set of separate detectors each of which requires training in order to make predictions. For v2, that is S*S*B detectors. Eg 19*19*2= 722 separate prediction vectors produced per image per forward propagation of the network.

Training
During training, most of the ground truth vectors are all zero - there is nothing in the image those detectors are ‘responsible’ for predicting. Where there is an object, the correct grid cell indices are determined from the ground truth bounding box center position. That puts you into the right cell. The correct anchor box index is determined using IOU of the ground truth bounding box shape with the anchor box shapes.

Suppose you have a 19x19 grid; two anchor boxes, one wide and low, one tall and skinny; and a person right in the center of a training image. The grid cell index would be (8,8) - center cell of a 19x19 grid. The anchor box index would be 1, since the IOU of the person object ground truth bounding box can be assumed higher with the tall and skinny anchor box than for the wide and low anchor box. This example therefore would have a 19x19x2 grid of vectors that was zero everywhere except at (8,8,1). At that location the ground truth vector would be filled out with a 1 for object present, the correct center indices, the correct object shape values, and the correct class index (note it need not be 1 since there will likely be far fewer anchor boxes than class types and no reason to assume the indices synch).

Training now proceeds as expected, with backprop driving predicted values \hat{y} towards the ground truth values y , meaning towards zero everywhere but at the (8,8,1) detector.

Note that if the car and person are both in the center of the image, then both the (8,8,0) and the (8,8,1) locations would have non-zero y and \hat{y}

Prediction
As implied above, during forward prop every detector produces an output vector based on its training. For the example image discussed, if training has gone well, all but two detectors will predict either no object or at least one with low confidence that thresholds out. You end up with two object detections, one for the car, and one for the person. Note that we will assume that these have low IOU, even though their centers are colocated, because their shapes are distinct. Thus they will both survive non-max-suppression.

Anchor box shape and predicted bounding box shape
Anchor box shapes also play a significant role in the generation of the shape predictions. The values the YOLO CNN is trained to produce are not actually the object shape in and of themselves. Rather, they are scaling factors that are multiplied with the anchor box shape to produce a shape prediction. The relationship between the network outputs, lets call them t_w and t_h, and the predicted bounding box shape, b_w and b_h is:

b_w = p_w * e^{(t_w)}
b_h = p_h * e^{(t_h)}

Where p_w and p_h are the anchor box, or prior, shape. If the ground truth shape is exactly the same as an anchor box shape, the network should be predicting 0 values for t_w and t_h.

Notice that these expressions facilitate bounding box predictions even much larger than the grid cell size, regardless of the anchor box shapes, a related question many people ask.

Hope this helps

Hi @ai_curious, thanks for the response.

I think I mean v2 (I believe this is the one that AN references in the lectures).

I understand now that we assign anchor boxes to each object in our ground truth, based on best IOU with the actual bounding box for that object.

I’m unclear how we assign anchor boxes to predictions. Presumably in early epochs there may not be a single clear object or bounding box defined by the network’s predictions (would these not be largely random after only one or two epochs?), so then how do we choose which anchor box to assign to a prediction to compare against the truth? Or do we just again assign based on best IOU and assume that everything will shake out after a few more iterations?

And if we are predicting scaling factors for an anchor box instead of the boundaries themselves, at what point do we choose which anchor box we are scaling? Where do we ‘input’ the anchor boxes in the prediction step, i.e. for scaling/comparison against ground truth?

All S*S*B detectors make a (set of) prediction(s) every forward pass. During training, those outputs are compared to ground truth in the loss function. During operational use, they are what they are. In both cases, there is no ‘assigning’; that happens only when initially setting up the ground truth data. Each prediction vector knows what grid cell and what anchor box it corresponds to because of where it sits in the network’s output matrix. As in the example above, the person object is being predicted by the (8,8,1) detector and the car by the (8,8,0) detector, so it is trivial using Python mathematics operators to multiply the (8,8,1) bounding box shape prediction by the anchor_boxes[1] shape and the (8,8,0) bounding box shape prediction by the anchor_boxes[0] shape. That step is actually pretty easy to see in this week’s YOLO programming exercise. Pretty sure that code fragment is in another thread in this forum. I’ll look for it and paste a link below.

EDIT it was buried in a long not entirely parallel thread, so here it is on its own…excerpt from yolo_head.py helper file

 box_wh = box_wh * anchors_tensor / conv_dims

box\_wh on the righthand side is the matrix of predicted bounding box shapes, actually e^t in my equations above, which are directly from the author’s paper. anchors\_tensor is the 2xB matrix of anchor box shapes. conv\_dims is a scaling factor to convert back to image relative pixel counts, not germane to this discussion. As you can see from this code fragment, all bounding box shape predictions are scaled by their corresponding anchor box shapes. This happens for all S*S*B detectors slash output locations every forward pass regardless of whether there is an object in that position of the image or not; no ‘assign’ step needed.

Here is the code pulling box center location and shape out of the YOLO feature output matrix. This is what produced box\_wh on the right hand side above.

box_xy = K.sigmoid(feats[..., :2])
box_wh = K.exp(feats[..., 2:4])

Predicted bounding box center locations box_xy is of shape S*S*B*2 pulled from the first 2 positions of the output features matrix. Note that no activation function is applied at the final layer in this implementation, so the sigmoid() is applied here. Predicted bounding box shapes box_wh is the next two values pulled from the output, with the exponential applied here, for symmetry I guess, rather than in the multiplication step as shown in my equations.

Ignoring the downsampling (ie grid cell size) scaling factor, what is going on in the code is really this

 box_wh = anchors_tensor * K.exp(feats[…,2:4]) 

which corresponds exactly to the equation in the paper. I know this is a lot to digest. Let me know if this is making sense :grinning:

I think I follow you, thanks for the detailed explanation, I appreciate it.

So in that case we are simply letting each of SxSxB detectors generate what output it may, then calculating the cost(s) for the detectors that actually have an object in them (in the ground truth), and ignoring those that don’t (p_obj = 0 so we don’t care what else has been predicted).

Where we do have a prediction, we know what our anchor box is for the given detector. Our network has predicted some scaling factors, which we apply to the anchor box for that detector, and compare this to the ground truth.

Am I on track with the above?

If that’s the case, how come we don’t just let the algorithm predict the bounding boxes directly? I appreciate that by defining, say, 2 anchor boxes per SxS grid cell we’re saying that there can be up to 2 objects there, but can’t we just include the 2 ground truth vectors as we are doing already and just let the algorithm learn against the true bounding boxes? I don’t see how we benefit from calculating scaling factors instead?

I think we’re saying the same thing now :+1:

I don’t find in any of the YOLO papers an obvious justification for the expression computing the predicted bounding box shape using the scaling factor (e^t) applied to the anchor box shape. My belief and understanding is that since the anchor box shapes are the ones that minimize average IOU error with the training data ( recall they were picked using k-means) you want your shape predictions to be strongly influenced by them. If you made your shape predictions without incorporating the anchor box shapes, you’re just throwing that prior knowledge away. Since you know the anchor box shapes are reasonably close to your ground truth shapes, use them as a baseline. This approach allows a straightforward way of doing that, while training the network to produce numbers in a fairly small range centered around 0, which helps with learning and stability during training.

You could give the papers a good scour or look for discussions by Mr Redmon if you really want to learn his thinking. Let us know what you find!

I should mention that there is a discussion in the v2 YOLO9000 paper about why the center locations are predicted as grid cell offsets instead of directly predicting bounding box coordinates, and it mentions that doing so improves model stability, especially during early iterations.. It then talks explicitly about the center location prediction, but not the shape prediction.

Thanks @ai_curious, this has been very helpful.

I think I was largely confused as I couldn’t see any obvious “mechanical” benefits of the anchor boxes. It seems like they are just a way of giving the YOLO algorithm a nudge in the right direction at the start of training, and give better stability.

Appreciate the detailed answers!

1 Like