Yolo centroids/conv impl

Hi everyone, i have one question about YOLO model:

if a regular CNN cannot detect multiple objects at once, why the yolo algorithm can get the centroids of each object present at the image?

or it is just because it computes a kind of “mean” of the coordinates of it object’s borders/edges, and how it would detect them?

additionally, just to clarify, what the convolutional implementation of sliding windows technique does is that it basically shrinks down the most important content of the image, so the cnn directly runs over these resultant few pixels, and then we get the inference, or i’m missunderstanding something?

thanks a lot!

Hi @pablocpz.ai

YOLO is different from CNN-based object detection methods because it divides the input image into a grid and predicts bounding boxes and class probabilities directly from each grid cell.

You’re correct! but it’s not just about shrinking down the content. It helps the network to focus on important regions of the image (captures local features) and reduces the computational cost.

ok, thanks for your reply!, but then, how it knows that the present object in that cell, is the center of an object, where we will say that the object is?

thanks in advance!!

nobody talks about how yolo knows where is the centroid haha

Each grid cell predicts bounding boxes based on its own coordinates relative to the grid cell. The model predicts bounding boxes and outputs confidence scores for each bounding box and class probabilities.

Here’s another recent thread with some relevant discussion.

so each cell cannot detect two objects from different classes, it isn’t? that’s why they try to use a finer grid?

@pablocpz.ai Can you provide an example of what you are thinking about here ?

I mean perhaps if you are speaking about a picture of a person wearing a shirt of a picture of a car, and your goal is to detect person/car-- So you are dealing with a subset of a greater set.

But otherwise I can’t think how you’d possibly have two classes in the same cell (?)

Person standing in front of a car is a classic example. That’s what the existence of anchor boxes allows. Firstly, the mere fact that B number of predictions can be made per cell. Secondly, the different shapes of the anchor boxes help the algorithm both learn and predict wider-than-tall versus taller-than-wide objects efficiently.

If you check my responses in the related thread linked above you’ll see that I have a different take on the ‘YOLO divides the image into grid cells’ meme. Give it a read and let us know what you think?

NOTE
In YOLO v1 there were B=2 detections per grid cell. In v2 as used in the exercise in this course, B=5 if I recall correctly. So 5 object predictions per grid cell.

I’m not sure if I fully understand your question, but each bounding box shows a single object while a grid cell can detect multiple objects.

so each grid cell predicts two objects (bboxes), but they only get the one with more confidence, so it only outputs one bbox?

furthemore, if non-max supression ensures to not have multiple predictions for the same object, which means that the object will be detected at the cell where it’s centroid is located, how the bboxes are obtained if the object occupates more than the cell size itself?

Not necessarily. 2 is not a magic number in the YOLO version taught in this class, rather it depends on how many anchor boxes are being used.

Again, not necessarily. If multiple predicted bounding boxes have high confidence but low IOU with one another, they can all be kept.

1 Like

@Alireza_Saei-- I was just looking back at your earlier post, and was like, wait a minute:

Are you saying Yolo is not a ConvNet model (i.e. say, during training) ?!?-- Or you mean rather instead it is different than your plain vanilla ConvNet ?

I think if YOLO did in fact divide the input image into grid cell sized subsets, it would be very difficult for a predicted bounding box to be larger than the grid cell. However, YOLO does not actually divide the input image at all, so this is a non-problem.

The actual mechanism by which bounding box shape is predicted, and its relation to grid cell and anchor box size, is covered in detail in existing threads. You can find them by advanced search using anchor box and my username. HTH

https://community.deeplearning.ai/search?context=topic&context_id=625730&q=%40ai_curious%20%22anchor%20box%22&skip_context=true

Hi @Nevermnd ,

Thanks for asking! I meant that YOLO follows a different approach compared to traditional ConvNet. YOLO is indeed a ConvNet.