Week 3 - Car Detection Anchor Boxes

anchors.txt is displayed below :


What do these numbers actually mean?

There are 5 anchor boxes, each with two dimensions. The first is the width and the second is the height. The dimensions are scaled in units of the grid cell size. Note that the whole point of anchor boxes is that they are “floating” and are used to find actual bounding boxes that have the same aspect ratio but which will have fixed coordinates relative to the grid cell that contains the centroid of the object in question.

In other words, the purpose of anchor boxes is to make it easier to find plausible bounding boxes for objects in the scene. The notebook and the lectures don’t really say enough about them, but they are learned a priori through a separate training process. There are some really epic threads on the forums about this and other YOLO related subjects, here’s one specific to the anchor boxes question as a place to start.

Hope this helps. If you want to dig deeper on this subject, you can also try reading the paper on YOLO2 (aka YOLO9000). Note that the anchor boxes were added in YOLO2 and were not part of the original YOLO paper.


So YOLO1 didn’t allow multiple object detection in a single grid cell but the introduction of anchor boxes in Yolo2 lead to the algo gaining the ability to detect multiple objects in a single grid cell?

So these 5 anchor boxes were obtained after training? But Aren’t anchor boxes hyperparameters? If you have any other resource where anchor box has been explained properly then please mention

The anchor boxes are input to the training here, so I guess you could consider them as hyperparameters. But as I mentioned in my previous reply, they were learned by an earlier separate “unsupervised” training by doing K-means clustering on the shapes of the actual bounding boxes in the training set. You can find out more about this by reading the thread that I linked above.

I am not that familiar with YOLO1, but I don’t think what you said is correct. The whole point of YOLO is to be able to identify multiple objects in a scene. I think the addition of anchor boxes in YOLO2 is just a way to enable the algorithm to do a better job.

This is correct. Here is an exact quote from the original YOLO paper. “ YOLO predicts multiple bounding boxes per grid cell.”
Not much room for interpretation there, though I might argue that additional components of the ‘whole point’ were A) detect those multiple objects with reasonably high accuracy and B) do it wicked fast.

Here are the other salient points from that paper regarding number of predictions per forward pass:

  1. Our system divides the input image into an S × S grid.

  2. Each grid cell predicts B bounding boxes and confidence
    scores for those boxes.

  3. Each bounding box consists of 5 predictions: x, y, w, h, and confidence.

  4. Each grid cell also predicts C conditional class proba- bilities, Pr(Classi |Object). We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.

The entire original YOLO model is summarized in the caption to Figure 2 in paper as follows:

“Figure 2: The Model. Our system models detection as a regression problem. It divides the image into an S × S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S × S × (B ∗ 5 + C) tensor. For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20. Our final prediction is a 7 × 7 × 30 tensor.”

Anchor boxes were not introduced in YOLO 9000 aka v2 in order to support multiple objects predictions at runtime; that was baked in from the beginning. Rather, they were introduced to improve the stability and performance of the CNN during training time. A clue to this purpose is in the use of the word priors to describe them in the v2 paper. In my opinion both of the 2016 papers are must read for anyone wishing to master this topic. Hope this helps.


I tried reading the YOLOv1 paper and am unable to understand this :

It weights localization error
equally with classification error which may not be ideal.
Also, in every image many grid cells do not contain any
object. This pushes the “confidence” scores of those cells
towards zero, often overpowering the gradient from cells
that do contain objects. This can lead to model instability,
causing training to diverge early on
Above is from page number 3 of the paper under the Training section

Remember object detection has two parts. “What is it?”(classification) And “Where is it?” (localization). Training must minimize error in each component. This first assertion from the paper suggests that the business value of errors in these two may not have equal weight. That is, maybe an error in classification - predicts telephone pole when it is actually a walking human - is a more ‘expensive’ mistake than getting the bounding box off by a few pixels. A cost model with configurable weights would allow for tuning.

1 Like

YOLO v1 uses conditional probability. The prediction confidence is conditioned on object presence. This statement is pointing out that using this approach can result in the many empty grid cell locations (all background or unlabelled objects) dominating the few non-empty grid cell locations (the ones that contain labelled ground truth bounding boxes) during training. You want the model to learn the most from the non-empty cells, but since those are fewer in number it is similar to a standard ML class imbalance problem (if most of the cells are empty you can be reasonably accurate by just predicting ‘empty’ all the time)

Notice that just before and after the paragraph you quote the authors are talking about the loss function and how to tune it for classification vs localization, object presence vs absence and location error in large vs small boxes. Just using naive sum square minimization would result in the loss function providing the optimizer with inputs not aligned with the true learning objective.

1 Like

So, in the training stage, do you need labels of shape S * S*(B * 5 + C)(for each image)? Thats a little be confusing to me. So i have to label each grid outputs? Thanks, and sorry if my English is bad, i am not a native speaker

The shortest answer for all object detection algorithms is ‘yes’

The slightly longer answer is that the training input matrix, Y, and the network output matrix, \hat{Y}, must be the same shape for the loss function to make sense of them.

For YOLO specifically, that means have training inputs for each detector location in the network output, which means accounting for all the grid cells, anchor boxes, and classes. The formula is slightly different for the original YOLO paper quoted above versus the second paper, which is the one this course was based on.

Notice that available object detection training data sets are unlikely to provide this, so you have to convert the labels to the proper grid cell + anchor box matrix location. Further, most object training data sets end up with a sparse YOLO input matrix, so you need to do a lot of spatial data augmentation. Otherwise, most of these detector locations have been trained only on 0 inputs and outputs and so will predict that best. HTH

1 Like

Thaks for the reply, i was wondering how each image has labeled, because a saw that are programs that labeling but in this format (c, h, w, x, y) where c is class h height w width and x and y the center, but that clearly is not the format that the loss function needed, so i think that there could be a function or something that convert that format into a format that match in shape with the model output in order that the loss function can handle, but i am not sure if is that how this work. Is that what you refer in the last paragraph? Thanks in advance!!

The ground truth bounding box labels are created completely independent of what algorithms and loss functions will be applied to them later. So you will likely need to do some work for whatever application you are using. For YOLO, first you have to compute the optimal number of and shape of the anchor boxes. Then, you have to compute where in the image the center of the ground truth label is, what grid cell it will fall into, and which anchor box has the highest IOU. You also need to convert the coordinates, and assign them to the correct grid cell + anchor box ‘detector’ location. I think I put a code fragment showing these steps into older threads you can find with search. Below are 2 that might be useful. HTH

1 Like

Thank you man i was really confused how labels YOLO were.