Is YOLO a regression or classification algorithm?

Martinmin · February 23, 2023, 7:56am

Although submitting the assignments, I am not clear about the algorithm’s type. The output includes:

out_scores, out_boxes, out_classes

out_boxes are coordinates of real values and classes are classification labels. So is this both a regression and classification algorithm? How is the loss calculated? This seems very different from what we learnt before so far.

gent.spah · February 23, 2023, 11:30am

This is because the NN include both regression and classification within them i.e. wx+b can be used for linear regression and g(wx+b) can be used for classification (g is an activation).

ai_curious · February 23, 2023, 12:15pm

From the first YOLO paper…

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

It seems different because it was different, which is why it was such a big deal when it was published in 2015/2016. The network outputs a couple hundred thousand floating point numbers (exact number depends on grid cell size and number of anchor boxes). So that part is entirely regression. Post CNN processing imputes meaning; eg these two numbers mean center location, these two numbers mean bounding box shape, this number is the index into the class list etc.

The loss function has evolved a little over time. You most commonly find a picture from the V1 paper which is this

x and y are object center coordinates, w and h are bounding box shape, C is class prediction, p is object presence. Weightable separate factors for location, shape, and type error. S is the grid cell count, B is bounding box count. Non-trivial

paulinpaloalto · February 23, 2023, 4:19pm

Interesting! Sorry, I should just go read the paper, but I’ll be lazy and take advantage of your expertise:

Given that the loss is a (weighted) sum of all the losses on the different parts of the predictions, it seems like they could just as easily have treated the object classification term in the usual “softmax” way and used cross entropy loss for that term. The gradients are nice and clean and you’re just adding them up in any case. Did they comment in the paper on whether they considered that and why they went the way that they did? Of course squared error is a lot cheaper to compute than logarithms, but in training it’s the gradients you’re computing not the actual cost values. And of course the derivative of log is a lot cheaper to compute than the logarithm itself.

Just curious. Obviously the method they chose worked out well, so “the pudding tastes great” is a perfectly valid answer.

ai_curious · February 23, 2023, 5:45pm

It’s always risky to write ‘this is how YOLO does it’ because YOLO changed over time. The V1 paper offers only We optimize for sum-squared error in our output of our model … because it is easy to optimize … however for V3, while location and shape still use sum of squared error loss, it switches to binary cross-entropy loss for the class predictions. If that isn’t complex enough, V3 also supports multi-label classification (eg German Shepherd and Dog) and predicts bounding boxes at 3 different scales.

V4, the last version from the main YOLO family tree, introduces many more architecture complexities, and the branch from the PyTorch / ultralytics groups, which some people argue shouldn’t even be named YOLO, differ even more. For purposes of this forum, I try to stick with V1 or V2, which is what the lectures and self-driving car lab were originally based on.

Martinmin · February 23, 2023, 5:54pm

So in V1 the whole thing is treated as a regression problem, including the object classification part. In the formula, p(c) is the probability of label c exists. What about the upper case C_i?

I am also curious, for C, all are just ‘1’ or ‘0’? for example, C_1 - C_2 = 1-0 = 1?

ai_curious · February 23, 2023, 6:32pm

Remember all those computations are all related to sum squared error. Meaning you compute difference between ground truth and predicted for each element. Not subtracting across elements. Also, the predictions are floating point, not integer. So if the ground truth probability of object presence known to be at a specific location is 1.0 the prediction might be .7 and the difference, 1. - .7 is what drives the optimization. That would be something like p_i(c) - \hat{p_i}(c) for a given i. p_1 - \hat{p}_1 not p_1 - p_2

Upper case C in that expression is for the object class.

Martinmin · February 23, 2023, 6:47pm

What’s C_i-C_i^?

p(c) are probabilities, but the ground truth of class label are 1 and 0? So in calculating the mean squared loss of the classification, it has to do the subtraction of 1 and 0?

ai_curious · February 23, 2023, 6:52pm

Ground truth p(c) has a 1. where there is actually an object, 0. elsewhere. Ground truth Class has a 1.0 for the index representing the correct class, 0. elsewhere. The prediction vectors, \hat{p} and \hat{C} likely have lots of non-zeros all over the place, hopefully getting closer to zero over training iterations.

After a few iterations the prediction vector for class might be car =.75 toaster= 0.03 truck=0.2 fish=0.02 etc., with car=.9 toaster=0.02 truck=0.07 fish=0.01 at the end of training.

Martinmin · February 23, 2023, 7:05pm

@ai_curious so in both cases the ground truth values would be either truth = (0 or 1), and predictions would be pred= (0,1), and the loss subtraction would always be (truth-prediction)^2, i.e. an integer (0 or 1) minus a float (i.e. 0.75)?

Another question is, is the idea of this algorithm similar to multi-input multi-output architecture? This is one example: python - Multi-input Multi-output Model with Keras Functional API - Stack Overflow. The difference is that YOLO uses a single input, single output architecture?

ai_curious · February 23, 2023, 7:38pm

I don’t understand that part. The predicted values are not binary output in the versions of YOLO used in this class.

To the extent that you apply different activations to different segments of the network output, I guess it conceptually resembles the multi-output network depicted in that thread. But YOLO doesn’t use different layers to do that. Rather, the single input image runs through the entire network, then the different parts of the output may have different post-neural network processing applied. In V2, which is what the code in the exercise is based on, center location and object presence are run through sigmoid, shape is passed to exponential, softmax is applied to the class vector.

I really believe the next step is to study the code. Hard to get any deeper without it. Cheers

Martinmin · February 23, 2023, 10:18pm

“pred= (0,1)”， I meant a float between 0< x< 1. Sorry for the confusion.

Topic		Replies	Views
YOLO Loss Function Convolutional Neural Networks coursera-platform	1	552	July 14, 2021
YOLO Algorithm and grid cells Convolutional Neural Networks week-3 , coursera-platform	11	89	March 19, 2025
How is the training done from 1919425 as labels are just class and boxes Convolutional Neural Networks coursera-platform	1	529	September 15, 2021
A clarification about Image Classification and Localization Algorithm and YOLO Convolutional Neural Networks coursera-platform	2	717	August 28, 2022
YOLO- Training dataset Convolutional Neural Networks week-2 , coursera-platform	3	42	January 17, 2025

Is YOLO a regression or classification algorithm?

Related topics