Deep dive on the YOLO v2 localization predictions

YOLO v2 makes 3 types of predictions on an image region. Is there an object there at all? What is it? Where is it? This thread examines the latter, the localization predictions.

There are five sets of numbers that play a role in these predictions; everything from converting 3^{rd} party training data to visualizing predicted bounding boxes on an image. There is some flexibility in the which set of numbers is used where, but they have to all be aligned and used consistently. Here are the players:

  1. t_* : the set of four numbers output by the neural net. 2 for predicted object center, 2 for predicted object shape. Whatever floating point values the neural network outputs.
  2. c_* : the set of two coordinates that give the index of the grid cell containing the object center location. These values are derived from the t_* prediction based on the ratio of image size to grid cell size. 0-based index, image-relative. See example below.
  3. b_* : four numbers that comprise a bounding box location, 2 for object center (x,y), 2 for object shape (w,h). Related to t_* through the equations below. b_x and b_y are image-scale relative. b_w and b_h are grid-scale relative.
  4. box_* : four numbers that comprise the predicted bounding box location, converted from the center + shape format to 2 (x,y) pairs
  5. p_* : the shapes of the anchor boxes, determined using K-means clustering

Here are the equations defining the relationships of the first three groups:
b_x = \sigma(t_x) + c_x
b_y = \sigma(t_y) + c_y
b_w = p_w e^{t_w}
b_h = p_h e^{t_h}

Here is a graphical depiction from the original publication of YOLO v3. Note this diagram first appeared in the v2 paper from 2016.

In that v3 paper, the author writes:

During training we use sum of squared error loss. If the ground truth for some coordinate prediction is \hat{t}_* our gradient is the ground truth value (computed from the ground truth box) minus our prediction: \hat{t_*} − t_*. This ground truth value can be easily computed by inverting the equations above.

from which we infer the following ‘inverted’ equations:

logit(b_x - c_x) = t_x
logit(b_y - c_y) = t_y
log(\frac{b_w}{p_w}) = t_w
log(\frac{b_h}{p_h}) = t_h

In order to promote stability during training, the center location coordinate error is not evaluated directly. Instead, the sigmoid function is used to constrain the values. That is, 0 <= \sigma(t_{x,y}) <= 1.. This is important. The network outputs for the center coordinate predictions are never used directly. This means when preprocessing your labelled training data, you don’t need to store t_x and t_y directly, since they are never used. Instead, you store \sigma(t_{x,y}), which is equivalent to storing b_x - c_x and b_x - c_y. That is, you store the fractional part of the image-relative location. For example, for the exact center of the center grid in a 19x19 grid cell matrix the following values pertain:
b_x = 8.5
b_y = 8.5
c_x = 8. (0-based index of the center grid cell x-offset)
c_y = 8. (0-based index of the center grid cell y-offset)
\sigma(t_x) = b_x - c_x = 8.5 - 8. = 0.5
\sigma(t_y) = b_y - c_y = 8.5 - 8. = 0.5

If you were predicting the center location directly, without the constraint, then you would need to apply the logit() when creating the training data, but since you’re always using only the sigmoid-constrained value of the center location, in effect you’re doing \sigma(logit(b_x - c_x)) = b_x - c_x. Your ground truth for center location is b_x - c_x which is compared to \sigma(t_x) in the loss function. More on that below.

Bounding box shape is handled differently. Here, the sigmoid constraint is not applied, and you do end up using t_{w,h} directly. Accordingly, when you create ground truth data it is using the log of the ratio of bounding box shape to the ‘best anchor’, which is picked from the set of anchor boxes using IOU. That is, \hat{t}_{w,h} = log(\frac{b_{w,h}}{p_{w,h}}). That log of the ratio for ground truth, \hat{t}_{w,h} is directly compared to the predicted t_{w,h} in the loss function. So now in the loss function computation of coordinates loss you have:

        # center is constrained by sigmoid.  
        # remember, GT already has sigmoid implicitly applied!!!
    truth_sigma_txy = ground_truth[...,1:3]
    predicted_sigma_txy = K.sigmoid(predicted[...,1:3])

        #shape is not constrained
    truth_twh = ground_truth[...,3:5]
    predicted_twh = predicted[...,3:5] 

        #boxes still in (x,y) (w,) format ie not corners
    truth_t_boxes = np.array([truth_sigma_txy, truth_twh])
    predicted_t_boxes = np.array([predicted_sigma_txy, predicted_twh]) 

    coordinates_loss = coordinates_weights * K.square(truth_t_boxes - predicted_t_boxes)

Because the center location is sigmoid constrained, and the shape is a ratio of bounding box to anchor box, you end up with more manageable gradients and values on the order of magnitude of 1. If you went for direct prediction the values could be up to the size of the image in pixels, order of magnitude hundreds at least. Given the number of predictions YOLO is making simultaneously (millions) from each forward pass, it needs every advantage you can give it.

Thoughts and suggestions welcome.

hey @jonaslalin I solved part of the puzzle, which is why location and shape are treated differently in the ground truth load (preprocess_true_boxes()) and the loss function (yolo_loss()). Still gnawing on the question about 1 versus sigmoid(1) for detection.