Defining a custom class for YOLO loss

A fellow learner reached out and asked me about my YOLO implementation. Some of the code is a mess right now because I was trying to break it apart in order to be able to train the classification separately from localization (as suggested in one of the YOLO papers). The classification layers are a subset of the network architecture and produce a different output shape, then the localization is trained using transfer learning techniques but again, a different output shape. Pretty significant changes to the data structures and code. The approach I used for defining a custom class for the complicated YOLO loss function remains unchanged, though, and might be interesting, so here it is.

The first step is to define a Python class that extends, or inherits from, tensorflow.keras.losses.loss. It requires an __init__ method and a call method. I have omitted some computation and housekeeping details of the call method. Hopefully what remains matches your understanding of what the YOLO loss function is designed to do. [You might need to scroll the code window to see it all]

class Yolo_Loss(losses.Loss):
    def __init__(self,true_object_locations_mask,matching_true_boxes, anchors, batch_size, name="yolo_loss"):
        self.true_object_locations_mask = true_object_locations_mask
        self.matching_true_boxes = matching_true_boxes
        self.anchors = anchors

    def call(self, truth, predicted):
           #extract ground truth values
        truth_txy = truth[...,1:3]  #Ground Truth centers - use the sigmoid for classification loss!
        truth_twh = truth[...,3:5]  #Ground Truth shape
        truth_class_probs = K.softmax(truth[...,5:])  #Ground Truth class(es)
           #extract predicted values from YOLO output object
        predicted_to = predicted[...,0:1]   #predicts object is there or not - use sigmoid for confidence loss!
        predicted_txy = predicted[...,1:3]  #predicts centers - use the sigmoid for classification loss!
        predicted_twh = predicted[...,3:5]  #predicts shapes - direct prediction
        predicted_class_probs = K.cast(K.softmax(predicted[...,5:]),'float64')  #predicts class(es)


           # (0. - predicted_presence) is the error when there is NOT an object in GT 
           # (1. - predicted_presence) is the error when there IS an object in GT               
        no_objects_loss = no_object_weights  * K.cast(K.square(0. - predicted_presence),'float64')
        objects_loss    = has_object_weights * K.cast(K.square(1. - predicted_presence),'float64')
        confidence_loss = objects_loss + no_objects_loss

           #classification loss for matching detections
        matching_classes = K.cast(matching_true_boxes_batch[...,4:5],'float64')  #GT class
        classification_weights = CLASS_LAMBDA * true_object_locations_mask_batch
        classification_loss = classification_weights * K.cast(K.square(matching_classes - predicted_class_probs),'float64')


            #coordinates loss is only computed for true object locations
        coordinates_weights = COORDINATES_LAMBDA * true_object_locations_mask_batch
        coordinates_loss = coordinates_weights * K.cast(K.square(truth_t_boxes - predicted_t_boxes),'float64')


        total_loss = 0.5 * (confidence_loss_sum + coordinates_loss_sum + classification_loss_sum)

I omitted some of the matrix housekeeping fluff. Once you have the class defined, you can instantiate it in the model definition process

    #define custom loss function pointer for model
custom_loss_fn = Yolo_Loss(true_object_locations_mask, matching_true_boxes, use_anchors, TRAINING_BATCH_SIZE)

model = yolov2_full_detection()

         #define optimizer per YOLO9000 paper 
 #   We train the network ... for 160 epochs using stochastic gradient descent 
 #   with a starting learning rate of 0.1, polynomial rate decay 
 #   with a power of 4, weight decay of 0.0005 and momentum of 0.9
opt = tfa.optimizers.SGDW(learning_rate=0.1, momentum=0.9, weight_decay=0.0005)

   #compile model
model.compile(optimizer=opt, loss=custom_loss_fn, run_eagerly=True)

afterwards you train and run just like any other model

   #train model
history =, 
                    batch_size = TRAINING_BATCH_SIZE, 

def predict(filename, model):
    training_images = np.zeros((1, 416, 416, 3), dtype=float)

    image =
    training_images[0] =  np.asarray(image) / 255.

    return model.predict(training_images)

   # run model
predictions = predict(filename, model)
1 Like

@big-bbox hope this helps. @ai2ys not sure if this exactly addresses your question, but maybe food for thought.

1 Like

Thanks a lot @ai_curious ! It does !

Just 3 additional questions :

  1. “yolov2_full_detection()” is that the yolo_body OR yolo_body + yolo_head?

  2. How exactly did you make y_train ? and what dimension is it ?
    For one class, I expect it to be : (m,19,19,5,6) with m the batch size, 5 bounding boxes, 1 class
    If you run the “preprocess_true_boxes()” function, which is provided, you have :
    detectors_mask : (m,19,19,5,1)
    matching_true_boxes : (m,19,19,5,5)
    Did you concatenate these two together to form y_train ?

  3. How did you provide the values for “use_anchors” (in your loss function) ?
    For 5 bounding box, is it a numpy array of shape (5,2) with float values ?

1 Like

It is equivalent to yolo_body(). it’s the layers of the model itself, not the post processing of the model output. NOTE in the darknet code used in the Autonomous Driving exercise, the output of the model is contained in the variables feats - short for features. The final layer of activations are performed in the utility function yolo_head in this section

    box_confidence = K.sigmoid(feats[..., 4:5])
    box_xy = K.sigmoid(feats[..., :2])
    box_wh = K.exp(feats[..., 2:4])
    box_class_probs = K.softmax(feats[..., 5:])

I do that inside my loss function instead.

My y_train came from Berkeley Deep Drive data that I preprocessed myself to get into YOLO input format. I used 8 anchor boxes, so it was (m,19,19,8,6)

detector_mask and matching_true_boxes are then subsets of y_train…not vice versa

Covered in this thread… Deriving YOLO anchor boxes

1 Like

You answered some questions i even forgot to ask :joy: thanks ! you’re the YOLO expert of this specialization

I’ve spent two hours reading what you said, looking at the code from the “Yadk” folder and YOLOv2 implementation from “experiencor” … no wonder they didn’t ask us to implement it in the assignments, its a mess

1 Like

I understand what you say, but in their implementation, they pass it as input, to be able to reuse it in the loss function :[image_data, boxes, detectors_mask, matching_true_boxes], np.zeros(len(image_data)), validation_split=validation_split, batch_size=32, epochs=5, callbacks=[logging])
That’s why i’m very confused about their whole script. Your implementation looks a lot better than theres

So do I…

y_train = load_cropped_image_boxes(i,TRAINING_SET_SIZE,image_names,19,19,use_anchors)

true_object_locations_mask = K.cast((y_train[0:1,:,:,:,0:1] != 0.), 'float64')
matching_true_boxes = K.cast(y_train[0:1,:,:,:,1:6],'float64')  #includes class designation

custom_loss_fn = Yolo_Loss(true_object_locations_mask, matching_true_boxes, use_anchors)

class Yolo_Loss(losses.Loss):
    def __init__(self,true_object_locations_mask,matching_true_boxes, anchors, name="yolo_loss"):
        self.true_object_locations_mask = true_object_locations_mask
        self.matching_true_boxes = matching_true_boxes

  1. Load y_train labels corresponding to the training batch
  2. Use Python slicing to extract the true objects mask (locations that have objects ==1, other locations ==0
  3. Use Python slicing to extract the ground truth bounding boxes and correct class designation for locations where true objects == 1
  4. Pass them into the loss function constructor and cache for use during loss function execution.

Note that if this seems messy now, it gets even messier when you start working with a data set larger than fits in memory. The masks are no longer static.

If you read all the way through my thread on lessons learned, you see quotes about all the steps they went through in training. Separating the classification layers from the localization layers. Training the classification layers first on smaller images with lots of augmentation. Then applying transfer learning to train the localization layers on larger images. Dynamically adjusting learning rate after intervals of epochs…

Getting YOLO to work completely from scratch is non-trivial, which is why you weren’t asked to do it in this class, and why almost no paper or blog you find on the web does it either. They almost all say ‘we started with a trained model…’

And the versions of YOLO that followed the one we use in the class, v3, v4, v5, are even more complex architectures. Not for the faint of heart…Good luck

Did you use tf.Data API to load the data? I think it could maybe help prevent this

I know, and I totally agree with you. I gave up and implemented a different architecture. The future is Anchor free :joy:

One last question : what metrics did you use to measure your object detection model ? Did you find a package / library online that did it ? I tried this Github repo but their code is buggy somehow. I’d rather avoid having to define the IoU, precision, recall, precision-recall curve, Average Precision myself…

I didn’t try this, but is definitely the direction I was headed to. I started out trying to use a very small data set using my own custom data loader, then refactored to allow multiple small sets. When I concluded I would need much more data and computational power than my laptop would accommodate is when I stopped working on it. Notice that if you are streaming data off the disk the trick of preprocessing true box masks and caching them in the custom loss instance stops working. Need a method for refreshing the masks each time you load, which I didn’t get around to writing.

Similar to the class exercise, at test time I displayed the predicted bounding boxes on the test image, and they were so bad it wasn’t worth computing any metrics on them. During training time I instrumented the loss function using TensorBoard and tracked separately all the many components of total loss. That’s how I knew it was the center coordinates prediction that was my problem (shape actually wasn’t so bad).