Questions about YOLO

so i’m going through C4W3 courses (object detection with CNN), and I still cannot get a good enough intruition on why algos such as YOLO works.

If I understand correctly, we do not use formal sliding windows because they are very computational expensive, so we use an alternative solution which deals with convolution layers.

So when it comes to the YOLO algo for example, we used grid in split the input image, so the y_hat at the end has a the same height and width than the grid.
It looks like then in this case that we use a stride that’s equal to filter size. In a sense that each element of the grid does not overlap with others. Why is that so? Is it not necessary to consider those cases?

Second question in that let’s say for example we have a image in which the object is really close, so in each grid, we may only see one little tiny fraction of the object, in this case how can the NN know that such position is what? Forthemore how can it predict the total object size (box size) based on a fraction of it?

In case of multi objects detection (within one grid), how can the NN predict the center points (in the courses it is noted b_x, b_y) of object 2 if it is behind the object 1?



Hi, I struggled with questions similar to the ones you ask, and I hope I can help you reach a satisfying intuition. But before I answer it I want to make sure you understand some things, if you do that’s great!

Firstly we need to understand that the “grid” is only really existent and used in training i.e there is no actual defined grid created in the actual convolutional network architecture of YOLO. The grid is just used for us humans, to provide a simple intuitive way to label objects. Here’s how you can think about this.
You are an ML engineer and wants to teach a CNN to learn to classify different objects with bounding boxes. But you know sliding etc is computationally expensive and you come up with the idea that you want 1 look, 1 iteration of CNN to classify all objects. But how would you even train this network? You clearly need an output that somehow labels all images with bounding boxes as consistently as possible. Therefore you think it is a good idea to create a s by s grid, and label objects only for the grid cells where the midpoint of an object lies in the cell. Saves you a lot of work. You also want a bounding box so you make the cell output that as well.
And then you decide, well since I am using an s by s grid, and each cell may or may not have an object, you already know your imagination 1 look CNN must have an output of s by s by (things you want one cell to output), which would be s by s by y(which would be 4 bounding box co-ordinates + number of classes+1 for is there anything at all, I am ignoring anchors and other complications at the moment).

And That’s it! That’s YOLO. Overly simplified, but that is the intuition. So to recap, The grid is just a scheme to consistently label our output and make it easy for the network to learn. The significance of the grid in the actual CNN of YOLO you ask? It is just that we end up with an output of s by s by (predictions per cell) so we can draw an analogy of the input image having been condensed into a grid of probability distributions.

Now coming to your question 1.

I can see why you would think if we want to condense an image to an s by s grid, we would just use a stride equal to filter size, but as you can see that is in no way what we actually do in the model. You are right, if we did what you did it would cause loss of data and it is necessary to consider those cases.

Coming to question 2:

If you read my intro para, I hope you can understand that each grid cell does not make a prediction based on only the pixels of that grid cell, since the CNN is being fed the WHOLE image, and it’s not that each cell is analyzing only that specific part of the image.
So even if an object slightly appears in a cell, the CNN is not making a prediction looking at just that portion. I encourage you to look up the “receptive field”. Basically what this means is (refer to the YOLO network above), they use a 7 by 7 grid output, so consider the 7 by 7 by 1th vector, that’s the output of the CNN for the possible detections/bounding boxes detected by the 1st cell. But this cell had inputs that come all the way back from the whole image. Hence, it is possible for this architecture to learn to predict bounding boxes even if something very small appears in the cell.

Although, it is very unlikely such a cell will produce an output that surpasses and gets through non-max suppression since there must be some other cell that covers a greater portion of the image which the CNN may have a higher probability output for.

Question 3:
The NN doesn’t exactly predict mid-points in the sense it is not an output of YOLO, midpoints are used to train the network, so as a human it makes sense to say that YOLO is in fact learning where midpoints might be and giving those cells a higher probability value output.
And If an object is slightly covered, that’s okay, YOLO, isn’t calculating midpoints, it’s just a network that learns to give predictions, so it will naturally predict a close enough midpoint. (This is what we assume the network is doing)


Thank you very much for the detailed answer !

Yes I guess I was too much focused on the “Convolutional Implementation of Sliding Windows” video, which deals with only one conv layer.
But indeed the YOLO algo has many many more layers, and so the grid is just a symbolic representation in the input space.


Anytime! Glad I could clear it up.

1 Like

Edit: Oh never mind. I figure it out. They were converted by yolo_head. I had to look at the code in keras

May I ask more about predicted midpoints ? Since we supervise the network to produce an output midpoint as a coordinate in relative to the grid’s during training and not relative to the whole image. it’s value then should be ([0,1], [0,1]) of the grid.

What I don’t understand is that, when we feed these output values into tf.image.non_max_suppression, we feed them in flattened shape, so those midpoint coordinate wouldn’t have been able to refer back to its corresponding grid anymore, so how does the non-max suppression alg. knows which midpoint coordinate belongs to which grid that it knows how to calculate IOU. Unless they were manually converted somehow (which was not discussed in the course), I suspect that the output midpoint coordinate is actually in relative to the whole image, but how does YOLO magically know that if we trained them with midpoint coordinate in relative to its corresponding grid ??

1 Like

The network output shape provides the ‘magic.’ The network output is S*S*B*(1+4+C) meaning each cell in the S*S*B part makes (1+4+C) predictions. Let’s look at the shape part first.

B is a vector of anchor boxes, or priors or dimension cluster in Redmon-ease, that form the basis of the bounding box shape predictions. The network directly outputs values t_w and t_h that are defined as related to the bounding box width and height b_w and b_h like this:

b_w = p_w * e^{t_w}
b_h = p_h * e^{t_h}

In English, the bounding box values are a multiple of the prior box values (p is for prior here). The multiplicative factor is the exponential of the network’s two shape outputs for that anchor box location.

Rather similar idea for the center location. The network outputs two values t_x and t_y that represent offset within one grid cell. The bounding box location center (b_x,b_y) is related as:

b_x = \sigma(t_x) + c_x
b_y = \sigma(t_y) + c_y

The center coordinates are the sum of the sigmoid function applied to the network location prediction outputs plus a constant. What is that constant? Just the 0-based index of the grid cell location. In this format, an object centered within a 19x19 grid would have b_x = 8.5 where 0.5 is from the grid cell relative \sigma(t_x) part and 8 is the 0-based image relative x index of the center grid cell.

Using the second set of equations is what enables moving back and forth from grid relative to image relative location coordinates. The grid relative fractional part ( because 0 <= \sigma(t) <= 1) is provided as output from the network. The image relative integer part is implicit in the location of that output in the network output object.

And yes, in this particular YOLO implementation these transformations happen in the yolo_head and preprocess_true_boxes helper functions. Hope this helps future readers.


Here is a direct quote from the original YOLO paper:

Each grid cell predicts B bounding boxes and confidence
scores for those boxes…Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell

Object mid-point is exactly (part of) what YOLO outputs.

1 Like

very helpful discussion! So it looks to me that we can attribute whatever meaning to the output values, then the NN will use data to figure out what is the link from raw data to the output. Of course, there should be theoretical information flow from raw data to the output, instead of random guess.


Hi, Ping,

Welcome to the community.

Yes, we train the image through the defined perspective i.e the YOLO alogrithm, where we already have a pre-defined set of structure and then we follow the same steps as we go through the process. We need to bear in mind the threshold that we are considering to achieve the desired score : probability and the IoU.
This article through a web search has all the profound details that we build in about YOLO.

I have one small doubt. YOLO algorithms takes midpoint of the object present in the grid or midpoint of the grid where the object is present? In video its said that it takes midpoint of the image… taking midpoint of the grid is computationally accurate or taking midpoint of the image??

1 Like

The important location is the center of the object whether that is a known location (ground truth) or a predicted location (network forward prop output). Most of the time YOLO uses grid-relative coordinates, so (8.5,8.5) means the exact center of the center grid cell in a 19x19 grid like the one used in the car detection exercise. It is trivial to convert from grid cell relative to image relative since you know the grid cells and image dimensions, and there are times when you need to convert. Namely, change from image relative to grid cell relative to set up training data, and from grid cell relative to image relative to visualize predicted bounding boxes superimposed on the image.

1 Like