How to interpret values of box_xy, box_wh in yolo_eval

I’m wondering how to interpret the numbers in box_xy, box_wh, with slices box_xy[0][0] and box_wh[0][0] shown below.

[[ 0.885695    1.6324562 ]
 [-1.0968447  -1.3350451 ]
 [ 1.3657959  -6.310956  ]
 [-0.54382193 -0.24732351]
 [-3.580883   -3.31002   ]], shape=(5, 2), dtype=float32)

[[ 3.2747407   7.876538  ]
 [-0.95626986  6.16789   ]
 [ 2.974455   -8.062703  ]
 [ 4.3521757   3.149329  ]
 [ 2.8536212   5.6248055 ]], shape=(5, 2), dtype=float32)

I am confused at whybox_wh contains negative values, and why the range of box_xy values go outside [0,1] unlike the convention used in the lecture video.

In real life, box_xy and box_wh are related to the output of the YOLO CNN as:
box_xy = K.sigmoid(feats[…, :2])
box_wh = K.exp(feats[…, 2:4])

Thus 0<= box\_xy <= 1 because of the sigmoid()

box_wh is scaled relative to anchor boxes. The relationship is as:

b_w = p_w * e^{t_w} and
b_h = p_h * e^{t_h}


t_w = log(\frac{b_w}{p_w})

where p_w and p_h are the shape of an anchor box (width and height, respectively)

and the value can be < 0

However, notice that the first time you invoke yolo_eval() it is with dummy data, and the values don’t mean anything. That exercise is useful only to see what is going on with the shapes.

The equations are explained in the YOLO 9000 paper as:



In yolo_eval, the shape of the boxes being passed into yolo_non_max_suppression is (None, 4), so are the coordinates of boxes here absolute, all having the same reference point?

If so, since in real life values in box_xy[i][j] in yolo_output are relative to the top left corner of their respective grid, doesn’t this mean that the coordinates in different box_xy[i][j]'s have different reference points? If this is the case, where in yolo_eval do we adjust for this such that coordinates in boxes being passed into yolo_non_max_suppression all share a common reference point?

1 Like

Yes, different coordinate systems are used in different places.

Grid-cell-relative is beneficial inside the loss function and optimization. It keeps all components of the loss - predictions for object presence/absence, bounding box center, bounding box shape, and object class - on the same scale so they can be added into a single value for the optimizer.

However, grid-cell-relative coordinates with multiple anchors per grid cell can result in multiple predictions for the same object. In order to disambiguate and prune duplicates there must be a common point of reference. This is why the grid cell relative b_x, b_y, b_w, b_h tuple is converted to image relative x1, y1, x2, y2. The first step is using the helper function yolo_boxes_to_corners().

def yolo_boxes_to_corners(box_xy, box_wh):
   """Convert YOLO box predictions to bounding box corners."""
    box_mins = box_xy - (box_wh / 2.)
    box_maxes = box_xy + (box_wh / 2.)

    return K.concatenate([
        box_mins[..., 1:2],  # y_min
        box_mins[..., 0:1],  # x_min
        box_maxes[..., 1:2],  # y_max
        box_maxes[..., 0:1]  # x_max

def yolo_eval(yolo_outputs,
"""Evaluate YOLO model on given input batch and return filtered boxes."""

    box_confidence, box_xy, box_wh, box_class_probs = yolo_outputs
    boxes = yolo_boxes_to_corners(box_xy, box_wh)

which you can read in

(notice the code stores the values in y1, x1, y2, x2 order here)

The second step is done right in yolo_eval():


# Scale boxes back to original image shape.
height = image_shape[0]
width = image_shape[1]
image_dims = K.stack([height, width, height, width])
image_dims = K.reshape(image_dims, [1, 4])
boxes = boxes * image_dims

ps: Notice that there is no reason to calculate \frac{box\_wh}{2.} twice in yolo_boxes_to_corners(). Once you have calculated and assigned to box\_mins, you can just do box\_maxes = box\_mins + box\_wh. The more you look at the underlying Python, the more of these little inefficiencies you can find. Definitely recommend a thorough inspection before using any of this open source github stuff for your own project.


After applying yolo_boxes_to_corners in yolo_eval, the output boxes has shape (19,19,5,4) where (for example) boxes[18][18] contain the grid-cell relative coordinates for each anchor box relative to the bottom rightmost grid of the image.

However, the next step was passing boxes into yolo_filter_boxes which returns an flattened array of shape (None, 4). At this point, aren’t all the coordinates still grid-cell relative but the spatial information of where the referenced grid is relative to the image lost because the vector is flattened? If so, how does passing this into scale_boxes in the next step transform the coordinates from grid relative to image relative?

It’s hard to trace the scale conversion through just reading the code, but it looks to me that for the boxes = boxes * image\_dims line to produce expected results, boxes has to be units of fraction of total image. For example [.25 .25 .75 .75] * [608 608 608 608] = [152 152 456 456]

To confirm one could instrument the code and collect data during a forward prop.


after further digging, it looks like the rest of the predicted bounding box scaling takes place within the helper function yolo_head() at which time the grid information is still available.

box_xy = (box_xy + conv_index) / conv_dims
box_wh = box_wh * anchors_tensor / conv_dims

So I now believe the code used for this exercise spreads out producing the bounding box in image-relative coordinates into 3 places: yolo_eval(), yolo_head(), and boxes_to_corners().


If I understand correctly, box_xy and box_wh that is part of yolo_output has actually been (partially) converted to image relative coordinates and thus isn’t equivalent to the b_x, b_y, b_w, b_h which are grid relative?

Just wanted to observe that this is not necessarily so. It turns out that, like many things in this YOLO implementation, there is a nuance that is easy to miss. The center point of a predicted bounding box is encoded as a combination of the offset of the point within its grid cell plus the offset of the grid cell origin from the image origin. That means the center point of grid cell (0,0) is (0.5, 0.5) but the center of grid cell (1,1) is (1.5, 1.5)

This means that the grid offset is carried with the coordinates even after it is removed from the (m,S,S,B, (1+4+C)) network output structure until such time as you want to scale back to true image pixel count coordinates for drawing the boxes.

1 Like

This would mean that prior to flattening, the coordinates should be converted to be image-relative instead of grid relative?

no, this encoding means you can still convert scale after the flattening.

suppose a flattened box is initially encoded as [1.5 1.5 .5 .5]

this means the center of the box is at x = 1.5 ‘grid cells’ which in this exercise is 32 pixels wide. you can easily convert by multiplying by 32. 1.5 * 32 = 48 which is the center of the (1,1) grid cell. similarly, the width and height are 0.5 ‘grid cells’ or 16 pixels.

x1 = bx - bw/2 or 48 - 8 = 40
y1 = by - bh/2
x2 = x1 + bw or 40 + 16 = 56
y2 = y1 + bh

box [1.5 1.5 0.5 0.5] in ‘grid cell’ and box [40 40 56 56] in ‘image pixels’ are equivalent for grid cell size of 32 pixels.

This is relevant because I am finding that if you convert to image-scale before doing the computations in the loss function the coordinates loss values are large and the training compromised. It seems to work better to leave the coordinates in ‘grid cell’ scale all the way through until the non-max-suppression step. Hope this helps

1 Like

Sorry, I meant, is the reference point / origin for all coordinates in the (m,S,S,B, (1+4+C)) network output structure the same? For example, the reference point for the coordinate describing the center of the (1,1) grid cell (given as [1.5,1.5]) is the same as that of the (0,0) grid cell (given by [0.5,0.5])?

Also, during training, are we using this convention as opposed to the b_x, b_y, b_w, b_h described in lecture where the spatial origin differ depending on which grid cell the point falls in (grid relative)?

b_x represents the fractional part of the offset. c_x represents the integer. For box\_xy == (1.5,1.5), b_x == 0.5 and c_x == 1

Thus b_x is relative to its own grid cell origin while c_x is relative to the image.

The sum b_x + c_x, which is what is stored in box\_xy[0], is relative to the image.


So during training, we are still training against targets in b_x, b_y, b_w, b_h format, but by the time we get to yolo_eval, the provided box_{xy} coordinates are relative to the image (all have the same reference point)?