Although submitting the assignments, I am not clear about the algorithm’s type. The output includes:

out_scores, out_boxes, out_classes

out_boxes are coordinates of real values and classes are classification labels. So is this both a regression and classification algorithm? How is the loss calculated? This seems very different from what we learnt before so far.

This is because the NN include both regression and classification within them i.e. wx+b can be used for linear regression and g(wx+b) can be used for classification (g is an activation).

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

It seems different because it was different, which is why it was such a big deal when it was published in 2015/2016. The network outputs a couple hundred thousand floating point numbers (exact number depends on grid cell size and number of anchor boxes). So that part is entirely regression. Post CNN processing imputes meaning; eg these two numbers mean center location, these two numbers mean bounding box shape, this number is the index into the class list etc.

The loss function has evolved a little over time. You most commonly find a picture from the V1 paper which is this

x and y are object center coordinates, w and h are bounding box shape, C is class prediction, p is object presence. Weightable separate factors for location, shape, and type error. S is the grid cell count, B is bounding box count. Non-trivial

Interesting! Sorry, I should just go read the paper, but I’ll be lazy and take advantage of your expertise:

Given that the loss is a (weighted) sum of all the losses on the different parts of the predictions, it seems like they could just as easily have treated the object classification term in the usual “softmax” way and used cross entropy loss for that term. The gradients are nice and clean and you’re just adding them up in any case. Did they comment in the paper on whether they considered that and why they went the way that they did? Of course squared error is a lot cheaper to compute than logarithms, but in training it’s the gradients you’re computing not the actual cost values. And of course the derivative of log is a lot cheaper to compute than the logarithm itself.

Just curious. Obviously the method they chose worked out well, so “the pudding tastes great” is a perfectly valid answer.

It’s always risky to write ‘this is how YOLO does it’ because YOLO changed over time. The V1 paper offers only We optimize for sum-squared error in our output of our model … because it is easy to optimize … however for V3, while location and shape still use sum of squared error loss, it switches to binary cross-entropy loss for the class predictions. If that isn’t complex enough, V3 also supports multi-label classification (eg German Shepherd and Dog) and predicts bounding boxes at 3 different scales.

V4, the last version from the main YOLO family tree, introduces many more architecture complexities, and the branch from the PyTorch / ultralytics groups, which some people argue shouldn’t even be named YOLO, differ even more. For purposes of this forum, I try to stick with V1 or V2, which is what the lectures and self-driving car lab were originally based on.

So in V1 the whole thing is treated as a regression problem, including the object classification part. In the formula, p(c) is the probability of label c exists. What about the upper case C_i?

I am also curious, for C, all are just ‘1’ or ‘0’? for example, C_1 - C_2 = 1-0 = 1?

Remember all those computations are all related to sum squared error. Meaning you compute difference between ground truth and predicted for each element. Not subtracting across elements. Also, the predictions are floating point, not integer. So if the ground truth probability of object presence known to be at a specific location is 1.0 the prediction might be .7 and the difference, 1. - .7 is what drives the optimization. That would be something like p_i(c) - \hat{p_i}(c) for a given i. p_1 - \hat{p}_1 not p_1 - p_2

Upper case C in that expression is for the object class.

p(c) are probabilities, but the ground truth of class label are 1 and 0? So in calculating the mean squared loss of the classification, it has to do the subtraction of 1 and 0?

Ground truth p(c) has a 1. where there is actually an object, 0. elsewhere. Ground truth Class has a 1.0 for the index representing the correct class, 0. elsewhere. The prediction vectors, \hat{p} and \hat{C} likely have lots of non-zeros all over the place, hopefully getting closer to zero over training iterations.

After a few iterations the prediction vector for class might be car =.75 toaster= 0.03 truck=0.2 fish=0.02 etc., with car=.9 toaster=0.02 truck=0.07 fish=0.01 at the end of training.

@ai_curious so in both cases the ground truth values would be either truth = (0 or 1), and predictions would be pred= (0,1), and the loss subtraction would always be (truth-prediction)^2, i.e. an integer (0 or 1) minus a float (i.e. 0.75)?

I don’t understand that part. The predicted values are not binary output in the versions of YOLO used in this class.

To the extent that you apply different activations to different segments of the network output, I guess it conceptually resembles the multi-output network depicted in that thread. But YOLO doesn’t use different layers to do that. Rather, the single input image runs through the entire network, then the different parts of the output may have different post-neural network processing applied. In V2, which is what the code in the exercise is based on, center location and object presence are run through sigmoid, shape is passed to exponential, softmax is applied to the class vector.

I really believe the next step is to study the code. Hard to get any deeper without it. Cheers