Questions about YOLO Algorithm

Hi there,
I am trying to understand the YOLO algorithm but I am getting confused on certain aspects of it. I would really appreciate some clarification.

  1. Anchor boxes are chosen by inspecting the training set and identifying common shapes in which objects occur in the images. This would define the bh and bw parameters of anchor boxes. What are bx and by for anchor boxes? Are they mid-points of the grid cell?

  2. Based on my understanding, the input image is conceptually divided into grids. Each grid cell encodes information about anchor boxes. So, what does each grid cell report at the end? Does it first identify the most likely object in every box in the grid cell? Following which it checks the maximum p_c value for all the anchor boxes in the grid cell and report only the anchor box with max p_c? This is how the Visualization (not bounding boxes) appears to be done.

  3. How is b_x and b_y for a test set computed?

Maybe I am the only one, but it feels like YOLO described in the assignment is more involved than YOLO described in lectures.


There really aren’t center points for anchor boxes. During training, you might think of them as being iteratively floated to the center of each ground truth bounding box and compared with it using IOU to help learn which anchor box shape is best for each training set object. The ‘assignment’ of the single anchor box that has the highest IOU with the training object is then encoded into the learned parameters.

After forward propagation, for each grid cell there is a vector of predicted values. The vector includes the coordinate positions (one each for bx, by, bh, bw) and class predictions per anchor box. If the objects in the forward prop input resemble a training object, these presence and class predictions will be high, otherwise they will be low. The list of candidates is then subject to thresholding and duplicate disambiguation in code that is part of YOLO but outside of/downstream from the CNN.

Just ground truth bounding box corner locations scaled by image size. Maybe also scaled by grid cell size. There is flexibility in where the conversion between image-relative and grid cell-relative coordinates takes place.

I really do not understand what you said here. The mechanics of the algorithm seem to confuse me with what the algorithm is actually doing. When a user is labeling data for grid cells and adding ground truth data for each anchor box, what are the values being provided for bx and by?

I assume that bh and bw for the anchor box 1 stay the same for all grid cells. Similarly bh and bw for anchor box 2.

There are no values bx, by for anchor boxes; they have no location, only shape. And probably not correct to say ‘ground truth data for each anchor box.’ Ground truth is the true positions of the labelled objects in the training set, their bounding boxes. Anchor boxes are the set of shapes determined as the K-means centroids that minimize overall area error with the bounding boxes. They aren’t themselves bounding boxes.

This is correct. Number of, and shapes of, anchor boxes is determined during exploratory data analysis on the data set. Once determined they are applied consistently.

Anchor boxes are not an easy concept to understand.

One way to look at it is that after all of the training data has been labeled, you take the sizes of all of the bounding boxes that were used in the labels, and select the five most commonly used sizes.

These are the anchor boxes. They are box sizes that the algorithm should give priority to using.

Here is an additional reference:

Sorting based on training set occurrences might be one approach, but it isn’t what the YOLO inventors did; they ran a K Means clustering analysis. I don’t believe you are guaranteed that K Means cluster centroids represents any actual data set member. (Anchors were introduced in the second 2016 YOLO paper and called ‘priors’)

It also seems like the simple sorting approach could suffer from imbalance in the training set, resulting in good predictions on a certain similar shape (say the top 5 occurrences all relate to nearby motor vehicles) but doing poorly on others (classes with different aspect ratios like traffic signs or humans or on less common sizes such as for same class but different distance). One can imagine a data set with 5 shapes having more than one occurrence but that these are grossly different from the size and shape of the vast majority of detection targets all of which happen to have unique shapes in the data set. K Means with IOU helps protect against these.

EDIT - not sure that anyone reads these old threads, but if you’re here and want to understand more about how the YOLO inventors decided on which anchor boxes to use, and quantitatively why their approach is superior to just selecting the most common shapes in the training data, take a look here…[Deriving YOLO anchor boxes]

Like most things in machine learning, I don’t think there is a single simple universally applicable answer. It requires engineering tradeoffs on the data set, the runtime environment, and the business problem/domain. The ‘correct’ answer for self-driving vehicle probably won’t be the same as subject identification on a mobile device camera etc