Hi there,
I am trying to understand the YOLO algorithm but I am getting confused on certain aspects of it. I would really appreciate some clarification.

1. Anchor boxes are chosen by inspecting the training set and identifying common shapes in which objects occur in the images. This would define the bh and bw parameters of anchor boxes. What are bx and by for anchor boxes? Are they mid-points of the grid cell?

2. Based on my understanding, the input image is conceptually divided into grids. Each grid cell encodes information about anchor boxes. So, what does each grid cell report at the end? Does it first identify the most likely object in every box in the grid cell? Following which it checks the maximum p_c value for all the anchor boxes in the grid cell and report only the anchor box with max p_c? This is how the Visualization (not bounding boxes) appears to be done.

3. How is b_x and b_y for a test set computed?

Maybe I am the only one, but it feels like YOLO described in the assignment is more involved than YOLO described in lectures.

Regards,
Dinesh

There really aren’t center points for anchor boxes. During training, you might think of them as being iteratively floated to the center of each ground truth bounding box and compared with it using IOU to help learn which anchor box shape is best for each training set object. The ‘assignment’ of the single anchor box that has the highest IOU with the training object is then encoded into the learned parameters.

After forward propagation, for each grid cell there is a vector of predicted values. The vector includes the coordinate positions (one each for bx, by, bh, bw) and class predictions per anchor box. If the objects in the forward prop input resemble a training object, these presence and class predictions will be high, otherwise they will be low. The list of candidates is then subject to thresholding and duplicate disambiguation in code that is part of YOLO but outside of/downstream from the CNN.

Just ground truth bounding box corner locations scaled by image size. Maybe also scaled by grid cell size. There is flexibility in where the conversion between image-relative and grid cell-relative coordinates takes place.

I really do not understand what you said here. The mechanics of the algorithm seem to confuse me with what the algorithm is actually doing. When a user is labeling data for grid cells and adding ground truth data for each anchor box, what are the values being provided for bx and by?

I assume that bh and bw for the anchor box 1 stay the same for all grid cells. Similarly bh and bw for anchor box 2.

There are no values bx, by for anchor boxes; they have no location, only shape. And probably not correct to say ‘ground truth data for each anchor box.’ Ground truth is the true positions of the labelled objects in the training set, their bounding boxes. Anchor boxes are the set of shapes determined as the K-means centroids that minimize overall area error with the bounding boxes. They aren’t themselves bounding boxes.

This is correct. Number of, and shapes of, anchor boxes is determined during exploratory data analysis on the data set. Once determined they are applied consistently.

Anchor boxes are not an easy concept to understand.

One way to look at it is that after all of the training data has been labeled, you take the sizes of all of the bounding boxes that were used in the labels, and select the five most commonly used sizes.

These are the anchor boxes. They are box sizes that the algorithm should give priority to using.