Encoding the anchor boxes

The YOLO architecture is IMAGE (m, 608, 608, 3) → DEEP CNN → ENCODING (m, 19, 19, 5, 85).
I could not understand how the encoding takes place for the image.

Let’s say I have an image with dimensions as (608, 608, 3). I have a total 2 objects in the image for which I have two bounding boxes as [Pc1, bx1, by1, bh1, bw1, c1, c2] and [Pc2, bx2, by2, bh2, bw2, c1, c2]. Let’s say I am choosing to use 5 anchor boxes. How exactly do I go from Image → encoding as given in the assignment?

If you look at predict(), then, I think you can fill the gap.

The final output from Darknet, a convolutional network for Yolo, is as you see,

conv2d_22 (Conv2D) (None, 19, 19, 425) 435625 ['leaky_re_lu_21[0][0]']

This is yolo_model_outputs in the following code block from predict().

yolo_model_outputs = yolo_model(image_data)
yolo_outputs = yolo_head(yolo_model_outputs, anchors, len(class_names))
out_scores, out_boxes, out_classes = yolo_eval(yolo_outputs, [image.size[1],  image.size[0]], 10, 0.3, 0.5)

Then, the next step is yolo_head, which can be seen at “./yad2k/models/keras_yolo.py”. So, what you want is in there.

Here is an overview of object detection/localization steps by Yolo.
Please also see this link that is written by ai_curious for anchor box related operations.

The output from the network includes all candidate boxes. As you see, an image is split into 19x19 grid. And, each has 5 anchor blocks. (The center of each anchor box is inside a grid selected.) And, each anchor block information has box (4 (position related info) + 1 (confidence) + 80 (probability distribution)) length.
Yolo head extracts those information from the network output (19x19x425), and creates a list of 4 tuples, i.e, (box_xy, box_wh, box_confidence, and box_class_prob).
Then, Yolo Eval, that you wrote, works on filtering and non_max_suppression to get the final boxes with class information.

I think the above covers your question. Hope this helps.

Thank you for your response. But my question was about the input to the darknet/any other architecture.

Let’s say I want to use the Darknet/ MobileNet/any other architecture as the backbone and then apply the Yolo method to get the predicted bounding boxes.

My image dimensions are (300, 300, 3) and a total of ‘m’ examples. That makes the input dimension (m, 300, 300, 3).

In image 1, I have a single ground truth bounding box but in the second image, I have 3 ground truth bounding boxes.

I want to use 2 anchor boxes and grid dimensions as (7, 7).

For a single object localization task, I would have simply passed my input image through the convnet and using the ground truth values of single bounding box get the predicted values.
In this object localization task, I had my x as the image and Y_gt (Y ground truth) as [Pc bx by bh bw c1 c2 c3].

My Assumption:

But here in the object detection case, how do I get my ground truth Y based on the ground truth bounding boxes?
Do I need to assign the bounding box to the grid that has the bounding box’s centroid and then give each anchor the respective values (like if anchor 1 belongs to a car, anchor 2 belongs to a pedestrian and my image has a pedestrian, then anchor 1 has Pc → 0 and other values as don’t care whereas anchor 2 gets the Pc as 1, bounding box’s centroid that falls into the grid cell and the class)? In this way, I will have my Y_gt dimension as [m, 7, 7, 2, 8]-> 7 grid cells, 2 anchor boxes, and 8 as the output values.

After doing this my ground truth Y will match the output shape i.e (m, 7, 7, 2, 8).

I’m unsure if I am correct or if there is any other way to get the ground truth Y for object detection. I want to try it out myself, so any kind of help would be highly appreciated. Thank you.

I think it’s a reasonable question, since the most important step, “training” is not included in this exercise. The problem is, Yolo has multiple versions and different implementations. So, it is quite difficult to say how it is implemented. The best way is to read some of key papers like v2 and v3. There are several newer versions, but those are not done by an original developer.

This exercise is based on v2, but is not identical. Yolo v2/v3 were implemented in C. This exercise is basically Python with some ported code on Karas.
I will try to explain v2/v3 implementations as much as possible for your guidance, but, eventually, you may need to go back to papers.

But here in the object detection case, how do I get my ground truth Y based on the ground truth bounding boxes?

Of course, there is no ground truth at the inference time. So, let’s discuss about the training time.
The most important part in the training time is the loss function, i.e, how the network should be trained. Basically, the loss includes

  1. Differences of bounding (anchor) box location and size
  2. Objectness (Pc)
  3. Object class

With this loss function, the network is trained to generate the most possible anchor boxes with an objectness and object class to minimize the above losses. And implementation differ in the version of Yolo. Newer versions set the ground truth for training to be more anchor box oriented, i.e, adding “anchor box”, “grid location”, etc. so that the loss can be easily calculated.

Do I need to assign the bounding box to the grid that has the bounding box’s centroid and then give each anchor the respective values (like if anchor 1 belongs to a car, anchor 2 belongs to a pedestrian and my image has a pedestrian, then anchor 1 has Pc → 0 and other values as don’t care whereas anchor 2 gets the Pc as 1, bounding box’s centroid that falls into the grid cell and the class)?

This paragraph is slightly difficult to understand, but try one by one. At first, what we are talking is “anchor boxes”, which are pre-defined boxes for object detections, not “bounding box”.
And, as you wrote, the centroid of an anchor box determines the owner of an anchor box. In your case, you have two anchor boxes in one grid. So, there is a possibility that each detects the different objects to fit to the shape of an anchor box. And, those are independent. So, the anchor 1 catches a car with its objectness and the class number, and the anchor 2 catches a pedestrian with its objectness and the class number.
The problem is, in each grid, only two objects can be detected in the maximum case. So, newer version uses a smaller grid (i.e, increasing number of grids), and also using outputs from multiple layers in the backbone network to cover small/mid/large objects. Here is an overview of the architecture.

(Source : Chen, Shi & Demachi, Kazuyuki. (2020). A Vision-Based Approach for Ensuring Proper Use of Personal Protective Equipment (PPE) in Decommissioning of Fukushima Daiichi Nuclear Power Station. Applied Sciences. 10. 5129. 10.3390/app10155129)

One thing that I should add is, in V2, the structure is simple. We have a backbone network (Darknet) and Yolo head. From V3, one layer is added, which is called Yolo Neck to implement Feature Pyramid Network. That’s the picture above.
V3 uses 9 anchor boxes. (V2 uses 5). And, 3 anchors are assigned to each layer.

Back to your case, you have 7x7 grid cells. And 2 anchor boxes for each. In your first definition, there are three classes. (This is a probability distribution. So, in the case of c1, c2, c3, we have 3 classes.) So, the shape from the network should be (m, 7, 7, (2x(4+1+3)) = (m, 7, 7, 16). Then, in Yolo head or Yolo Eval, you extract information about anchor boxes like (x, 7, 7, 8).

If you look at the loss function for this exercise, it will be more complex to calculate IoU and additional filtering based on Pc. But, key thing is, x, y, w, h, Pc, class info,… all come from the network trained with the loss function.
And, NMS and some other final process are not attached to the network at the training in the V3. This is to focusing on the training of the network based on the loss function.

Hope this helps some.

I want to try it out myself, so any kind of help would be highly appreciated.

This is really good thing. V5 has Pytorch version, and V3 has Keras version. Please select your preferable version to fit to your purpose and environment.

It’s important to realize that anchor boxes are not typed and aren’t assigned based on the training object class. Rather, they are assigned based on shape. During training, each ground truth bounding box is compared to the set of anchor box shapes using IOU, and the anchor box with the highest IOU is assigned. You are correct that if an image has only one object, then only one grid cell + anchor box has non-zero training data assigned. The grid cell indices are determined by the ground truth object center coordinates, and the anchor box index is based on shape.

So you should be thinking “my anchor box 0 is short and wide and anchor box 1 is tall and narrow, and my image has a pedestrian, so anchor box 1 is likely to be assigned.” Hope this helps.


This is what my target labels look like. The shape of the target label is (m, 9, 9, 6), here I have the number of grids as 9, the first value in the matrix is the object probability and since I have one class only, so for sake of it I have added 1, and the last 4 values in the matrix are the target bounding box (x, y, h, w) relative to the grid cell.

Next, I tried to implement the original YOLO v1 architecture (with lots of changes to fit the computing power my GPU has).

I have not yet implemented the original loss function, I wanted to check if the model would work. I tried the loss as mse and optimizers as SGD.

Please suggest the necessary changes, if I am doing it wrong.
Thanks and regards.

Your original post asks about encoding anchor boxes, but it looks like you are not using anchor boxes. YOLO v1 supported 2 predictions per grid cell, though they didn’t call them anchor boxes. Is dropping them intentional, perhaps an intermediate step on your journey? Did you really mean to ask about encoding bounding boxes ?

Well yeah, I did mean to use anchor boxes since increasing the grid cells won’t always work, even taking gid cells as (9 x 9) can’t encode all my ground truth bounding boxes but firstly I want to try them out with a single box/anchor. You’re right, in the original paper 2 predictions were made per grid cell but they didn’t use separate boxes/anchors.

Perfectly fine approach to leave them out at first. One of my early mentors was fond of saying something like “complex systems that work almost always evolved from simple systems that worked.” and he encouraged us to build something simple and learn from it, not try interstellar travel on the first flight.

If you look again at the initial paper I think you will find that they did make two predictions per grid cell; that is what the B represents in the output shape description S*S*B*(1+4+80).They didn’t have EDA-determined shape like the v2 and later algorithms (which by the way refer to them often as priors, not anchor boxes). When I started building my own YOLO implementations I used that equation explicitly and assigned values for S and B on the fly so I could drive experiments. HTH.

EDIT: @anon57530071 pointed out that for v1 they only used 20 classes, not 80 and that the network output was slightly different. The code I wrote back when I took this class uses the v2 model of S*S*B*(1+4+C). But the v1 expression is slightly different S*S*(B*(1+4)+C)) Always tricky describing “YOLO” because of the several subtle and not-so-subtle differences between versions.