Question about line in yolo loss function

no_objects_loss = no_object_weights * K.square(- pred_confidence)

Anyone have a good explanation for why one would take the negative of values you’re about to square? Is it just me?

Can you give a reference to where you saw this?
Is it from a lecture, if so please give the title and a time mark.


same code is in the original tf/keras port of Darknet circa 2017. In two cases it is clearly square of difference, so maybe the third one just a python shorthand for (0 - pred\_confidence) ? It seems like an unneeded computation.

I agree, that seems unnecessary.

I reformatted the code a little and ended up with this:

no_object_loss  = no_object_weights  * K.square( - predicted_presence)
object_loss     = has_object_weights * K.square(1 - predicted_presence)
confidence_loss = object_loss + no_object_loss

Here’s what I think is going on. predicted_presence represents the YOLO CNN output predicting whether an object is detected or not. I was confused that the loss function wasn’t explicitly comparing to ground truth in this step, as it does for location and classification. However, notice that the ground truth for “has object” is a 1, while ground truth for “has no object” is 0. I think the first line above is equivalent to:

no_object_loss  = no_object_weights  * K.square(true_presence - predicted_presence)

but only for the cells where there is no object in the ground truth, ie true\_presence = 0. For these cells, the squared error is (0 - predicted\_presence)^2

Similarly, the second line can be thought of as:

object_loss  = object_weights  * K.square(true_presence - predicted_presence)

for the locations where true\_presence = 1. Or:

object_loss  = object_weights  * K.square(1 - predicted_presence)

EDIT - removing the line where I said I think the code is correct. I no longer am confident in that

1 Like

So I spent way more time looking at this one line of code than is healthy, and am back to thinking there is a problem here. In an attempt to validate my implementation as well as isolate the source of any loss I ran the yolo_loss() function with the same values for both true and predict. The exercise was helpful in removing the last hiccups in my code, but I wasn’t getting expected output from confidence_loss and this line of code is the reason.

Here is the equivalent line from the yadk toolkit code provided with the class exercise:

    no_objects_loss = no_object_weights * K.square(-pred_confidence)

   objects_loss = (object_scale * detectors_mask * K.square(1 - pred_confidence))

    confidence_loss = objects_loss + no_objects_loss

Here is where values are assigned to pred_confidence inside yolo_head():

    box_confidence = K.sigmoid(feats[..., 4:5])

Notice the use of sigmoid(). If the output of the YOLO CNN is 1, box_confidence is sigmoid(1). Let’s say for sake of argument that this is a location that does indeed have an object in it, so the prediction is a ‘correct’ one. Here’s what happens inside the loss function (NOTE: box_confidence is returned from yolo_head() and assigned to pred_confidence inside yolo_loss(), though there is a(nother) bug in the code here and the return objects are in the wrong order). So for this thought experiment the distance being computed in the subtraction is (1 - sigmoid(1)):

objects_loss = (object_scale * detectors_mask * K.square(1 - sigmoid(1))

object_scale is a non-zero constant. We’re asserting that detectors_mask has a 1 in this cell since it is a ‘correct’ prediction in terms of object presence. So, spoiler alert, objects_loss ain’t 0 either.

Even if you have a 100% perfect prediction, you still get non-zero objects_loss, a non-zero confidence_loss, and non-zero total loss returned to the optimizer by the yolo_loss() function. I’m struggling to understand how that could be correct. Am I missing something here?

ps: for the record, sigmoid(1) is 0.731058 and sigmoid(0) is 0.5

1 Like

and here’s the bug related to yolo_head() return objects:

pred_xy, pred_wh, pred_confidence, pred_class_prob = yolo_head(...)

def yolo_head(feats, anchors, num_classes):
"""Convert final layer features to bounding box parameters.
    return box_confidence, box_xy, box_wh, box_class_probs

This code could never have been used as-is

@ai_curious, did you happen to find out what the problem was?

@jonaslalin honestly I think I need to step away for a while until I can see this clearly. In the YOLO v3 paper JRedmon reiterates the prediction formulas as:

b_x = \sigma(t_x) + c_x
b_y = \sigma(t_y) + c_y
b_w = p_we^{t_w}
b_h = p_he^{t+h}
Pr(object) * IOU(b, object) = \sigma(t_o)

and then states:

…ground truth … can be easily obtained by inverting the equations above

So I would expect ground truth to look like this:
t_x = logit(b_x - c_x)
t_y = logit(b_y - c_y)
t_w = log(\frac{b_w}{p_w})
t_h = log(\frac{b_h}{p_h})
t_o = logit(Pr(object) * IOU(b, object))

But that isn’t what I find in the toolkit code. In the code looks like this:

        detectors_mask[i, j, best_anchor] = 1
        adjusted_box = np.array(
                box[0] - j, box[1] - i,
                np.log(box[2] / anchors[best_anchor][0]),
                np.log(box[3] / anchors[best_anchor][1]), box_class
        matching_true_boxes[i, j, best_anchor] = adjusted_box

Only the (w,h) components match the ‘inverted’ prediction equations. I am struggling to understand why. Further, in the yolo_loss function, the computations for object and classification loss are done directly on the predicted values eg t_w and t_h, while the coordinates loss uses e^{t_w} and e^{t_h}.

Object presence always uses \sigma(t_o) in the loss function, so how can that be compared with the value in the detectors_mask, which is 1. That will never yield a 0 loss, even if you assign the same data for both true and predict. Doesn’t it seem like sending the same data for both into the loss function would generate 0 loss? It doesn’t as written in the first code fragment above.

The papers are terse and the code is cryptic at best, fully undocumented at worst. Whatever time you initially think it will take to implement your own YOLO, budget for 10x :wink:

Lesson learned :wink::sweat_smile: unfortunately, I haven’t had the opportunity to dig out the details of yolo implementation myself. Hopefully someone else here can assist.

Otherwise, one way might be to add issues in their GitHub repos and request additional documentation/clarification for implementation steps. Surely, more people probably have the same questions. Have you tried that way @ai_curious ?

I stared at and executed the yolo loss function locally enough times that I think I finally resolved my confusion over \sigma(t_o) versus t_o. The explanation is the same as the one for the location center coordinates t_x and t_y.

Object presence t_o is predicted such that 0 <= \sigma(t_o) <= 1. To derive the ground truth value \hat{t}_o You could invert that and say \hat{t}_o = logit(\sigma(t_o)). But it turns out you don’t actually ever need to use t_o directly. You only need to use \sigma(t_o), which we are defining to be the object presence prediction (or confidence…I think both terms are used in the various YOLO papers.)

For ‘correct’ locations, we want to compute the square error between the truth value, 1.0 , and the predicted value \sigma(t_o) and for the ‘incorrect’ locations compute the square error between the truth value, 0.0 , and the predicted value \sigma(t_o). In code this looks like:

 object_loss = … K.square(1 - predicted_presence)
 no_object_loss = … K.square(0 - predicted_presence)

So really the only question is why bother to subtract predicted_presence from 0 in the no_objects_loss term since we square it anyway.