Yolo centroids/conv impl

pablocpz.ai · May 11, 2024, 7:05pm

Hi everyone, i have one question about YOLO model:

if a regular CNN cannot detect multiple objects at once, why the yolo algorithm can get the centroids of each object present at the image?

or it is just because it computes a kind of “mean” of the coordinates of it object’s borders/edges, and how it would detect them?

additionally, just to clarify, what the convolutional implementation of sliding windows technique does is that it basically shrinks down the most important content of the image, so the cnn directly runs over these resultant few pixels, and then we get the inference, or i’m missunderstanding something?

thanks a lot!

Alireza_Saei · May 11, 2024, 7:18pm

Hi @pablocpz.ai

YOLO is different from CNN-based object detection methods because it divides the input image into a grid and predicts bounding boxes and class probabilities directly from each grid cell.

You’re correct! but it’s not just about shrinking down the content. It helps the network to focus on important regions of the image (captures local features) and reduces the computational cost.

pablocpz.ai · May 11, 2024, 7:23pm

ok, thanks for your reply!, but then, how it knows that the present object in that cell, is the center of an object, where we will say that the object is?

thanks in advance!!

pablocpz.ai · May 11, 2024, 7:25pm

nobody talks about how yolo knows where is the centroid haha

Alireza_Saei · May 11, 2024, 7:40pm

Each grid cell predicts bounding boxes based on its own coordinates relative to the grid cell. The model predicts bounding boxes and outputs confidence scores for each bounding box and class probabilities.

paulinpaloalto · May 11, 2024, 8:56pm

Here’s another recent thread with some relevant discussion.

pablocpz.ai · May 12, 2024, 11:45am

so each cell cannot detect two objects from different classes, it isn’t? that’s why they try to use a finer grid?

Nevermnd · May 12, 2024, 1:13pm

@pablocpz.ai Can you provide an example of what you are thinking about here ?

I mean perhaps if you are speaking about a picture of a person wearing a shirt of a picture of a car, and your goal is to detect person/car-- So you are dealing with a subset of a greater set.

But otherwise I can’t think how you’d possibly have two classes in the same cell (?)

ai_curious · May 12, 2024, 1:21pm

Person standing in front of a car is a classic example. That’s what the existence of anchor boxes allows. Firstly, the mere fact that B number of predictions can be made per cell. Secondly, the different shapes of the anchor boxes help the algorithm both learn and predict wider-than-tall versus taller-than-wide objects efficiently.

If you check my responses in the related thread linked above you’ll see that I have a different take on the ‘YOLO divides the image into grid cells’ meme. Give it a read and let us know what you think?

NOTE
In YOLO v1 there were B=2 detections per grid cell. In v2 as used in the exercise in this course, B=5 if I recall correctly. So 5 object predictions per grid cell.

Alireza_Saei · May 12, 2024, 1:30pm

I’m not sure if I fully understand your question, but each bounding box shows a single object while a grid cell can detect multiple objects.

pablocpz.ai · May 14, 2024, 8:20pm

so each grid cell predicts two objects (bboxes), but they only get the one with more confidence, so it only outputs one bbox?

pablocpz.ai · May 14, 2024, 8:33pm

furthemore, if non-max supression ensures to not have multiple predictions for the same object, which means that the object will be detected at the cell where it’s centroid is located, how the bboxes are obtained if the object occupates more than the cell size itself?

ai_curious · May 14, 2024, 9:44pm

Not necessarily. 2 is not a magic number in the YOLO version taught in this class, rather it depends on how many anchor boxes are being used.

Again, not necessarily. If multiple predicted bounding boxes have high confidence but low IOU with one another, they can all be kept.

Nevermnd · May 14, 2024, 9:45pm

@Alireza_Saei-- I was just looking back at your earlier post, and was like, wait a minute:

Are you saying Yolo is not a ConvNet model (i.e. say, during training) ?!?-- Or you mean rather instead it is different than your plain vanilla ConvNet ?

ai_curious · May 14, 2024, 10:28pm

I think if YOLO did in fact divide the input image into grid cell sized subsets, it would be very difficult for a predicted bounding box to be larger than the grid cell. However, YOLO does not actually divide the input image at all, so this is a non-problem.

The actual mechanism by which bounding box shape is predicted, and its relation to grid cell and anchor box size, is covered in detail in existing threads. You can find them by advanced search using anchor box and my username. HTH

https://community.deeplearning.ai/search?context=topic&context_id=625730&q=%40ai_curious%20%22anchor%20box%22&skip_context=true

Alireza_Saei · May 15, 2024, 7:27am

Hi @Nevermnd ,

Thanks for asking! I meant that YOLO follows a different approach compared to traditional ConvNet. YOLO is indeed a ConvNet.

Topic		Replies	Views
Questions about YOLO Convolutional Neural Networks	13	2435	January 23, 2025
Course4 Week3: Understanding YOLO Algorithm Convolutional Neural Networks	5	816	March 18, 2025
YOLO center of object Detection Convolutional Neural Networks	2	497	September 17, 2023
YOLO Algorithm and grid cells Convolutional Neural Networks week-3	11	87	March 19, 2025
A clarification about Image Classification and Localization Algorithm and YOLO Convolutional Neural Networks	2	715	August 28, 2022

Yolo centroids/conv impl

Related topics