Hey everyone I hope you are doing great.
I have a sort of brainlag in understanding the way of the training set for the YOLO Algorithms, what I got is we get labelers to label our images and gives us the concrete Y with class and counding box, then I got confused when prof Ng. introduced the grid cells. Are we training the ConvNet on each cell in the grid and if the center of the objects within the cell then this cell defines the object? If so wouldn’t this cell be small to have the representation of the object as if it only contains some noise.
Sorry for the disturbed thoughts and thanks in advance!
I think you can find an answer on this thread:
Thank you for the reply and sorry for late response.
@TMosh
What I have understood is that we have two divisions done, one is for the grid, we specify the grid dimensions, and other is the pixels of the image itself for which the ConvNet run over. For example if the input is 10x10 and grid is 5x5 then each two pixels form a grid cell which is responsible for the object prediction and bounding box specification for this cell only, but all of this is done over the whole image but feeding the whole image once to the network similar to the “Convolution implementation of sliding windows” part in the notes.
The network then tries to mimic the process of specifying the center to sepcify which cell is responsible for the object detection and surround it with bounding box, Am I right or Did I get something wrong?
Thanks again.
Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.
Formally we define confidence as Pr(Object) ∗ IOU truth pred.
If no object exists in that cell, the confidence scores should be zero.
Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
Each bounding box consists of 5 predictions: x, y, w, h, and confidence.
The (x, y) coordinates represent the center
of the box relative to the bounds of the grid cell.
The width and height are predicted relative to the whole image.
Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.
Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object.
We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.
At test time we multiply the conditional class probabilities and the individual box confidence predictions, which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the
object.
So the network is not trying to mimic the process of specifying centre, but each grid cell is responsible for detecting objects whose centers fall within that cell’s boundaries.
Basically divides the image into an S × S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities.
S × S × (B ∗ 5 + C) tensor
with each convolution layer dimension reduction, there is increase in the depth with filter channels to be equal to the output channels.
I recommend always including an explicit mention of the YOLO version when trying to describe YOLO design. This quote is directly from the original 2015/ early 2016 paper. But it had changed by late 2016 for the YOL0 9000: Better, Faster, Stronger paper, which is what the code in this course exercise is based upon. Also known as v2, see the caption for Figure 3. which states. …we predict the width and height of the box as offsets from cluster centroids…. Cluster centroids in Redmon-speak are anchor boxes in the vocabulary of this class. So in v2, width and height predictions are quite tied to the anchor box shapes and not whole-image-relative. The actual expressions for v2
b_w = p_w e^{t_w}
b_h = p_h e^{t_h}
where t_i are two of the network predicted outputs, p_i are the anchor box dimensions, e is the exponential function, and b_i are the bounding box shape predictions.
I feel like @MustafaaShebl seems to be thinking on the right track. I would describe it as the YOLO CNN, like any supervised learning trained neural net, tries to reproduce its ground truth values, which explicitly includes object center location, shape, class, and confidence and implicitly includes grid cell. There are existing threads in this forum that go into detail about setting up the training data, assigning grid cell, and computing Y - \hat{Y} during training. You can discover these with search. hth
Thank you so much really appreciate your responses. @Deepti_Prasad @ai_curious
What I have got is that like any supervised learning the network do the job of the labelers and the grid part is only to allow the network to identify multiple objects in the image and we dont take it one cell at a time we apply this similar to convolution implementation of sliding window alg, I hope I got it right this time.
Wanted to add a little to my attempt above to clarify whether the predicted bounding box shape is grid- or image-relative in YOLO v2.
The units of measure of the predicted bounding box shape, b_w and b_h, are pixel. Because they are a shape and not a location, they really are neither grid- nor image-relative. That is, a 16x16 square predicted bounding box shape is independent of both grid and image dimensions. There is no offset.
Also important to recognize that neither b_w nor b_w are direct outputs of the neural network. Rather, the network outputs t_w and t_h. Suppose a ground truth bounding box is exactly the shape of one of the anchor boxes. Then, using the expressions above,
\frac{b_w}{p_w} = e^{t_w} = 1 so
t_w = log(1) =0
That means for a ground truth bounding box exactly the same shape as an anchor box, the network is trained to predict 0 for t_w (and t_h) resulting in a predicted bounding box width (and height) exactly the same as p_w (and p_h) of the anchor box. Further, since 0 < e^i < \infty, the expression suggest that the same range applies to the predicted bounding box shape. Which means the predicted bounding box could approach zero pixels at the low end and infinity at the upper. With good anchor box shape choices and adequate training, the network learns to predict t_i as small values above and below but near zero; one would hope predictions do not in fact reach extreme outlier values, but in theory they could.
Does this mean that we are learning tw and th which takes values of ln function ]0, ∞[?
and thank you so much
Yes.
Similar for the center location values t_x and t_y, except here the expressions are:
b_x = \sigma(t_x) + c_x , and
b_y = \sigma(t_y) + c_y
Where c_i is the grid cell index derived from the location in the Y or \hat{Y} matrix, t_i are the location values output by the network for an object in the c_i cell of the matrix, and b_i is the derived bounding box center prediction.
Here the network predicts tx and ty after training over bx,by, cx and cy?
The equations are a bit confusing not gonna lie as we are adding the index of the cell to the sigmoid of the predicted value by the network as I understood, but why can’t we just predict normal bx, by, bw and bh or these equations turned out to give a better performance
I don’t think anyone who has studied this algorithm would disagree with you on that
The network outputs two numbers that are used for center location and two that are used for shape (for each grid cell + anchor box tuple.) But the numbers output by the network aren’t used directly in either case. Instead, they are transformed by these equations into 4 other numbers and it is this second set that are used in the cost function. The transformations are applied to the network outputs in order to force them to be of the right scale or magnitude to all play nice in the cost function together. To the best of my understanding, this approach was arrived at empirically; experiments proved the network learned better and faster this way than predicting the bounding box corners directly.
I get it now, thank you so much I really appreciate your effort.