Yes, different coordinate systems are used in different places.
Grid-cell-relative is beneficial inside the loss function and optimization. It keeps all components of the loss - predictions for object presence/absence, bounding box center, bounding box shape, and object class - on the same scale so they can be added into a single value for the optimizer.
However, grid-cell-relative coordinates with multiple anchors per grid cell can result in multiple predictions for the same object. In order to disambiguate and prune duplicates there must be a common point of reference. This is why the grid cell relative b_x, b_y, b_w, b_h tuple is converted to image relative x1, y1, x2, y2. The first step is using the helper function yolo_boxes_to_corners().
def yolo_boxes_to_corners(box_xy, box_wh):
"""Convert YOLO box predictions to bounding box corners."""
box_mins = box_xy - (box_wh / 2.)
box_maxes = box_xy + (box_wh / 2.)
return K.concatenate([
box_mins[..., 1:2], # y_min
box_mins[..., 0:1], # x_min
box_maxes[..., 1:2], # y_max
box_maxes[..., 0:1] # x_max
])
def yolo_eval(yolo_outputs,
image_shape,
max_boxes=10,
score_threshold=.6,
iou_threshold=.5):
"""Evaluate YOLO model on given input batch and return filtered boxes."""
box_confidence, box_xy, box_wh, box_class_probs = yolo_outputs
boxes = yolo_boxes_to_corners(box_xy, box_wh)
which you can read in keras_yolo.py.
(notice the code stores the values in y1, x1, y2, x2 order here)
The second step is done right in yolo_eval():
...
# Scale boxes back to original image shape.
height = image_shape[0]
width = image_shape[1]
image_dims = K.stack([height, width, height, width])
image_dims = K.reshape(image_dims, [1, 4])
boxes = boxes * image_dims