Here is a little more detail mapped specifically to the image in the OP. This grid is 3x3, with what appears to be 2 anchor boxes (one taller-than-wide, one wider-than-tall). The YOLO-independent ground truth would have 2 labels, one for the person, one for the car. When converted to YOLO input for training, there would need to be 3 * 3 * 2 == 18 locations, each holding (1 + 4 + C) == 8 values; one for p_c, 2 each for bounding box location and shape, and C class indicators (as a one hot vector). NOTE: the ground truth and the CNN output need to be the same shape in order to compare them in the loss function using a vectorized implementation. You want to just write confidence\_loss = \hat{p_c} - p_c and have Python matrix algebra work.
For this image, 16 of the 18 locations will have 0 for all 8 values. 2 locations will have non-zero values; the locations corresponding to the center grid location on the lowest row. That is, c_x = 1 and c_y = 2. Both of these locations will have p_c = 1 because there is an object present. Both of these locations will have the same values for b_x and b_y because the center of the person and the center of the car ground truth labels are colocated. One of these locations will have a C vector indicating car and the other will have a C vector indicating person. Say [0, 1, 0] and [1, 0, 0] (depends on the class index). Finally, each location will have different values for b_w and b_h to capture the different bounding box shapes of the two objects. From eyeballing the image, the location with the car record would have a b_w indicating a bounding box width of about 2.5 x grid cell width and a b_h of about 1.1 x grid cell height. The location with the person record would have a b_w of about 0.9 x grid cell width and a b_h of about 1.9 x grid cell height.
Notionally, you have something like this:
num_grid_cells_wide = 3
num_grid_cells_high = 3
num_anchor_boxes = 2
num_classes = 3
#initialize to 0
ground_truth = np.zeros(num_grid_cells_wide, num_grid_cells_high, num_anchor_boxes, (1 + 4 + num_classes))
#write values for locations that actually have data
ground_truth[1,2,0] = [1, 0.5, 0.2, 0.9, 1.9, 1, 0, 0] # person
ground_truth[1,2,1] = [1, 0.5, 0.2, 2.5, 1.1, 0, 1, 0] # car
Recap:
- 18 locations in the ground truth comes from (S*S*B) with S=3 and B=2
- 16 locations all zeros, 2 locations non-zeros
- Both non-zero locations are in the same grid cell, c_x=1, c_y=2
- Both non-zero locations have the same value for p_c (because there is a GT object present \hat{p_c} = 1. Output of the CNN will be some value 0.< p_c <= 1.)
- Both non-zero location have the same values for b_x and b_y (because in this image the labelled objects happen to have colocated centers)
- One non-zero location has the shape and class indicator for the person object
- The other non-zero location has the shape and class indicator for the car object