C4W3 YOLO training, anchor boxes and network's output tensor

I am still having an issue with wrapping my head around some of the YOLO concepts. I am laying out some of the things I did not understand after completing the assignment.

  1. Training process

The programming assignment does not focus on training the network since it takes time and computational resources to train. Reading from the original YOLO paper, the loss function is a bit complex and I was wondering what are the possible ways to implement this loss function in Keras or Torch?

  1. Anchor box

I downloaded the source files from the notebook and looking at yolo_anchors.txt file, I see the following:

0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828

I assume that it follows the format anchor1_width, anchor1_height, anchor2_width, anchor2_height, .... Is this a correct assumption?

Also, does this mean that each grid gets these exact anchor box dimensions?

  1. Output tensor

The final output of the YOLO network is a tensor. From original YOLO paper, there is a Fully Connected layer with 4096 neurons and then there is a final output of tensor 7x7x30. While implementing this in programming framework such as tensorflow or pytorch, usually the last layers are fully connected layers.
So, it means that 4096 neurons connect with 1470 neurons in the final output and then some post-processing steps are required such as reshape operation to convert the 1 dimensional array to a 7x7x30 tensor. Is this correct way of doing it or are there ways to have final output as a tensor while creating the network architecture?

There are quite a few reference threads about YOLO created by ai_curious on the forums that you should read:

Here’s one about how Anchor Boxes are derived. That is a separate learning step from the actual training of YOLO.

Here’s one that discusses how anchor boxes are used.

Here’s one about training YOLO.

There are more such threads which a little searching can find. The Discourse search engine works pretty well.

On the question of how the output layer works, if they talk about the fully connected layer in the paper, don’t they also say how they get to the final output? If not, we have the model imported into the notebook. You can print the “summary()” of it and see what the last few layers look like.

I think it is already extent elsewhere in the forum, but here it is again…


Remember that the variables without the hats are your ground truth, the variables with the hats eg \hat{w} are the predictions. It will need to be implemented into a custom loss function that is called each training iteration

It’s either w, h or h, w I don’t remember. Look at the exercise code that reads it.

The number of anchor boxes and their shape(s) are fixed prior to training and shouldn’t change without retraining.

Haven’t thought it through completely, but the different prediction types use different activation functions. That is, classification and object center location use \sigma, while object shape uses exponential. To me this argues for producing the composite object from the network and then using Python to post-process the different elements. For the record, that is how it is implemented in the code for the class exercise. Welcome dissenting opinions and constructive criticism.

1 Like

Hi, thanks for the links.

In terms of output tensor, they mention:

Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [22].

The diagram I am referring to is this:

The programming assignment uses YOLOv3 architecture and I had a look at its final layers.

This architecture suggests that there was never flattening and fully connected operations in place.

From this logic, could it be that in YOLOv1 paper, when they used Connected Layer they are instead talking 1x1x4096 and then connecting it to 7x7x30 tensor using Conv2D layers at each instance? This would mean that they are not performing flattening operations. Are my assumptions correct here?

I’m reasonably confident that the last 3 “boxes” shown in the v1 architecture are not convolutional, if that is what you’re suggesting. If you count the appearances of the abbreviations Conv. in the diagram caption, there are 24, which matches the text/narrative description. That corresponds to the first 7 “boxes”. The last layers are Conn. for fully connected. Remember, this is the v1 picture. It was changed for YOLO 9000 and v2 later that same year (2016).

Thanks for the response.

The number of anchor boxes and their shape(s) are fixed prior to training and shouldn’t change without retraining.

Now, there is this one line in yolo_anchors.txt file:

0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828.

As I understand we break the image in 7x7 grid size. So, considering 5 anchor boxes on each grid, these will be the different sizes of anchor boxes assigned to each of the grids. Since we only know height and width of each anchor box, its centre must be at the center of the box. Is this correct?

Also, during the training process, is the IOU calculated for these anchor boxes with that of the ground truths?

Further, after applying thresholds to the IOU and applying NMS, we select the anchor box as the bounding box for that object during detection. Is this correct?

Not exactly. There really isn’t a center for an anchor box, only a shape

Kind of. In order to determine which anchor box is the ‘correct’ or best location in the network output at which a prediction should be made this is done once, up front, before training iterations start. During training, anchor box shapes are only indirectly part of the calculations. This process is described in some of the links provided above and too extensive to repeat here.

Incorrect. Anchor box shape is never selected as predicted bounding box shape. Bounding box shape is predicted as two floating point numbers mathematically related to, but not necessarily the same as, the anchor box shape associated with the location making the prediction. Again, the relationship is explained in the linked threads. HTH