I am trying to prepare a dataset for object detection as mentioned in the course Convolutional Neural Networks by Andrew Ng. According to this, in YOLO, I understand that the image is divided into grids with each grid having it’s own target vector. But the datasets I see online have the same image mentioned several times with different boxes and classes, as if they are different datapoints.

How can I arrange these vectors into their respective grids?

Doesn’t the model get ‘confused’ while training if the same image is mentioned with multiple outputs in different datapoints instead of showing it as one datapoint with all the outputs?

The image isn’t actually divided. The entire image is input, once, into the neural net, which then makes a large number of predictions against it.

It is the training data labels, not the image, that get configured to match the grid cell and anchor box (and number of classes) specifications.

X - the input, is a single image - the entire thing Y - the training data labels, is a multidimensional object with shape according to the grid cell and anchor box count \hat{Y} - the network output, estimated values, or predictions, is the same shape as Y.

The work of setting up the training data for YOLO is mapping the object bounding boxes to the appropriate location in Y.

Try searching the forum on something like YOLO training data to find other related discussions. Here is one such…

Thanks for the quick response and pointing in the right direction. So, what I understand is, the input is just an image like an image classification but the output of the model has more labels to predict and these labels are obtained in a grid ,say 3x3, which gets flattened and the final output in this case would be 9 vectors. Is this correct?

Also, let’s imagine there are 2 objects of interest. Will the output consist of only the corresponding 2 vectors or all 9 vectors including those that are not of interest?

If you were working with a model architecture matched to one of the lecture videos, where a 3x3 grid is presented, then yes, the output would be 9 sets of predictions. In YOLO, there are also multiple anchor boxes per grid cell, so there are S*S*B sets of predictions, where S*S is the grid cell count and B is the number of anchor boxes. Each prediction set for YOLO has 4 values for bounding box (2 for center location and 2 for shape), 1 value for probability an object is detected, and C for the number of classes. In the autonomous driving programming exercise, the output shape has (19 * 19) * 5 * (1 + 4 + 80) = (19*19*425) = 153,425 floating point values.

The output can either be flattened or just sliced apart, the important part is that specific values ie a predicted center location or a class are in specific places of the output. The network makes all 153,425 (for example) predictions every forward pass. That is \hat{Y} is fully populated. The training labels Y also is fully populated, but every grid cell + anchor box location without an object gets 0. If the training has gone well, \hat{Y} has low values in those locations, and values that exceed threshold only in locations where there was a training object. If there was a single object labeled in the training input, there should be one location (grid cell + anchor box) with meaningful values in the output. If there was more than one object labeled in the training input, then there should be that many also in the output.

From this you can infer that YOLO can detect a maximum of S*S*B unique objects per forward pass. You can increase the maximum by increasing the S and B but at the cost of larger network output, bigger memory footprint, more computation, and slower frame rate. Cheers

I have an input with only the labels of the objects of interest. I can create some random vectors with 0 values and insert the actual vectors among them for the labels to match the required shape, but how do we pick the ‘correct location’ for the actual vectors among the non-relevant vectors?

You can easily start with Y the correct shape and initialized to 0. Then, overwrite just the locations that have training labels.

It is straightforward to determine in which grid cell the center of the object is located. First, determine how many pixels map to each grid cell. Then, from your bounding box label, determine the pixel coordinate of the object center. From those two numbers, determine how many grid cell units the object is offset from the origin. In the autonomous driving exercise the images are 608x608 and there are 19x19 grid cells, so each grid cell is ‘assigned’ a patch of 32x32 pixels. An object in the center of that image would be assigned to the S_x = 8, S_y = 8 grid location (assumes 0 indexed). An object with its center less than 32 pixels from the upper left hand corner would get S_x = 0, S_y = 0.

Anchor box assignment is more complicated. First, you determine optimal anchor box shapes for the training set using unsupervised learning. Then, you assign a training object to its best anchor based on IOU. Other threads in this forum cover it in more detail.

You end up with a Y matrix of shape (S,S,B,(1+4+C)) that is fully populated. All are zeros except for the positions (S_x, S_y, B, …) obtained from your labelled data. Note that since the loss function is basically doing Y - \hat{Y}, those matrices need to be the same shape.