Quick question regarding YOLO algorithm

Hi everyone,
I’m trying to understand the YOLO’s label ( y= [pc, bx,by,bh,bw, c1,c2,c3] ) for ground truth data y.

In this image example used in Dr. Andrew’s course, if we look at middle square of the bottom row, pc =1 as it contains the midpoint of object car and human.

But now If we look at the left bottom corner, is pc = 0? (although it contains a large proportion of the car, it doesn’t contains the middle point of the car, so we view it as it doesn’t contain any object?) and then all other values (bx,by,…c1,c2,c3) of this left bottom box will become ‘don’t care’?

So in summary, for labels (ground truth value that we trains ), only the box includes the middle point of an object will have pc =1, and all other boxes which don’t have that mid point (even a large part of that object is present in that box, e.g., a large proportion of the car is in the bottom left box) will still have pc =0?

An additional question regards the same objects with different sizes. (e.g., big car and distant small car). It’s not mentioned in video, but does YOLO have any special processes regarding object sizes? For example, a large anchor box designed for large near cars, might be too big for distant small cars, then will the YOLO algorithm break in this situation?



As part of its approach to object localization, the YOLO network predicts the coordinates of the object center. In order to do this, like any machine learning algorithm, it must first be trained on examples. The examples tell it which locations contain an object center and which locations do not. The examples also contain the shape of the object centered there, including whether that shape is larger than one grid cell, which is how it learns about the parts of the objects in the lower left of this training image.

Anchor boxes do not have a type. They only have shape. So there is no ‘car’ anchor box or ‘person’ anchor box. Only wider-than-it-is-tall anchor box and taller-than-it-is-wide anchor box (assuming those shapes are prevalent in the training data.) There are also no ‘close’ or ‘far away’ anchor boxes…just larger or smaller. So if the training data has lots of close cars and lots of far away cars there may be large and small anchor boxes both related to objects labelled as ‘car’ in the training data.

Speaking of different anchor box shapes, having multiple anchor boxes with different shapes is what would allow a YOLO v2 network to identify both the person and the car in this image. In the training data two locations would have a 1 for object presence. Both would have the same grid cell indices and object center location coordinates, but they would have different anchor box index and predicted bounding box shape.


Here is a little more detail mapped specifically to the image in the OP. This grid is 3x3, with what appears to be 2 anchor boxes (one taller-than-wide, one wider-than-tall). The YOLO-independent ground truth would have 2 labels, one for the person, one for the car. When converted to YOLO input for training, there would need to be 3 * 3 * 2 == 18 locations, each holding (1 + 4 + C) == 8 values; one for p_c, 2 each for bounding box location and shape, and C class indicators (as a one hot vector). NOTE: the ground truth and the CNN output need to be the same shape in order to compare them in the loss function using a vectorized implementation. You want to just write confidence\_loss = \hat{p_c} - p_c and have Python matrix algebra work.

For this image, 16 of the 18 locations will have 0 for all 8 values. 2 locations will have non-zero values; the locations corresponding to the center grid location on the lowest row. That is, c_x = 1 and c_y = 2. Both of these locations will have p_c = 1 because there is an object present. Both of these locations will have the same values for b_x and b_y because the center of the person and the center of the car ground truth labels are colocated. One of these locations will have a C vector indicating car and the other will have a C vector indicating person. Say [0, 1, 0] and [1, 0, 0] (depends on the class index). Finally, each location will have different values for b_w and b_h to capture the different bounding box shapes of the two objects. From eyeballing the image, the location with the car record would have a b_w indicating a bounding box width of about 2.5 x grid cell width and a b_h of about 1.1 x grid cell height. The location with the person record would have a b_w of about 0.9 x grid cell width and a b_h of about 1.9 x grid cell height.

Notionally, you have something like this:

num_grid_cells_wide = 3
num_grid_cells_high = 3
num_anchor_boxes = 2
num_classes = 3

    #initialize to 0
ground_truth = np.zeros(num_grid_cells_wide, num_grid_cells_high, num_anchor_boxes, (1 + 4 + num_classes))

    #write values for locations that actually have data
ground_truth[1,2,0] = [1, 0.5, 0.2, 0.9, 1.9, 1, 0, 0]  # person
ground_truth[1,2,1] = [1, 0.5, 0.2, 2.5, 1.1, 0, 1, 0]  # car


  • 18 locations in the ground truth comes from (S*S*B) with S=3 and B=2
  • 16 locations all zeros, 2 locations non-zeros
  • Both non-zero locations are in the same grid cell, c_x=1, c_y=2
  • Both non-zero locations have the same value for p_c (because there is a GT object present \hat{p_c} = 1. Output of the CNN will be some value 0.< p_c <= 1.)
  • Both non-zero location have the same values for b_x and b_y (because in this image the labelled objects happen to have colocated centers)
  • One non-zero location has the shape and class indicator for the person object
  • The other non-zero location has the shape and class indicator for the car object