How to prepare bounding box labels

In the bounding box lecture, as an example, the image was divided into 9 grids for illustrative purpose. It is also mentioned that in practice, it could be 19*19 grids. So just wondering how people prepare the labels for these grids.

For example, in one image, if 30% of the grids of the image contains objects that you want to detect. It would be 191930%=108 grids. Imagine if you just have 10,000 images. There would be 1,080,000 grids you have to label on bx,by,bn,bw. How could that be done?

1 Like

Initially the labels don’t depend on grids. You just define a bounding box and class for a location on the image. Later that can be easily converted to the correct ground truth data structure by algorithmically mapping image-relative coordinates to grid- and anchor box-relative values. That’s the good news. The bad news is you still have to have labels for the objects in the original 10,000 images. Which is one reason so many projects start with a model pre-trained on a well curated data set.

1 Like

Hello, thanks for your reply. I think I may not have good intuition about this. Let’s say for example the output layer is 19198 as discussed in the lecture video. This means the neural network examined 19*19 grids (or sliding windows) of the original image. And for each grid in the output layer, depth-wise, it has 8 outputs: pc, bx, by ,bn, bw, C1, C2, C3. So for whichever grid that contains objects (or most often parts of the objects) that we want to detect, pc will be 1 and we will need corresponding labels for the other 7 variables from bx to C3 as well?


In the training dataset, the coordinates of the boundary box and the class of the object is already given. So when you divide that image into 19*19 grids, you can compare and label the properties of any grid based on the labels and coordinates given in the training set!

There are 3 related but distinct steps.

First is labeled training data. These are a set of images and a text file that is often XML or JSON. The text file has a ‘record’ for each image that contains one category or class designation (car, traffic sign, driving lane, table etc) plus a location for each object in the image. Location is generally two corners of the bounding box. This data is not formatted for direct use in any kind of application, let alone a 19x19 Conv Net.

Second is this ground truth data reformatted into the structure of the using application. In your case, that is the CNN output format. This can be done 100% algorithmically, for example converting corners to center and shape and converting the label category to an integer for later use in the one-hot class vector. Generally only the location of the ground truth object center receives data, including the p_c = 1. All other locations are left blank (0).

Finally there is the output of forward propagation. Every location in the output structure contains predicted values. If the network has been trained well and generalizes well, there will be lots of low values for p_c and accurate location and class value predictions where predicted p_c is high. For outputs with low p_c there will still be predictions for location and class, but you won’t care what they are. It is comparison of these predicted outputs that have the higher p_c against ground truth that is performed during training to drive learning or evaluate accuracy.

Hope this helps.

Here is an example of a labelled training data file:

{ "images":
    "name": "b1c66a42-6f7d68ca.jpg",
    "attributes": {
        "weather": "overcast",
        "scene": "city street",
        "timeofday": "daytime"
    "timestamp": 10000,
    "labels": [
            "category": "traffic sign",
            "attributes": {
                "occluded": false,
                "truncated": false,
                "trafficLightColor": "none"
            "manualShape": true,
            "manualAttributes": true,
            "box2d": {
                "x1": 1000.698742,
                "y1": 281.992415,
                "x2": 1040.626872,
                "y2": 326.91156
            "id": 0

Here is an example of converting the ‘boxes’ part to training data for a YOLO CNN:

        for box in training_image['boxes']:
            ground_truth_boxes += 1
            x1 = int(box['x1'])
            y1 = int(box['y1'])
            x2 = int(box['x2'])
            y2 = int(box['y2'])
            raw_box = [x1, y1, x2, y2]
            print('GT data: ' + str(x1) + ',' + str(y1) + ',' + str(x2) + ',' + str(y2)) 

               #convert to x,y w,h.  BDD JSON ground truth data is in image-relative pixels
            bx, by, bw, bh = convert_corners_to_YOLO_format(x1, y1, x2, y2)
               #find grid cell for center (x,y) and adjust center coords
            cx, cy = get_grid_cell(bx, by)

            tx = logit(bx - cx)  # scale down by removing grid cell index offset
            ty = logit(by - cy)  # scale down to 0 <= ty <= 1
               #find best anchor
            best_anchor = get_best_anchor(bw, bh, anchors)
            tw = np.log(bw / anchors[best_anchor][0]) #t_w == log of w ratio
            th = np.log(bh / anchors[best_anchor][1]) #t_h == log of h ratio
               #write training data entry into (m,GRID_W,GRID_H,(1+4+1))
            Y[image,cx,cy,best_anchor,0] = 25.0 # object present ~ sigmoid(1.)  Only used in testing when GT == predicted
            Y[image,cx,cy,best_anchor,1] = tx # inverse sigmoid of grid-cell relative x offset
            Y[image,cx,cy,best_anchor,2] = ty # inverse sigmoid of grid-cell relative y offset
            Y[image,cx,cy,best_anchor,3] = tw # inverse exp of ratio of GT w to width of best anchor
            Y[image,cx,cy,best_anchor,4] = th # inverse exp of ratio of GT h to height of best anchor
            Y[image,cx,cy,best_anchor,5] = 1. #FIXME cars only for now.  need class lookup + one_hot when there are more

NOTE: my YOLO investigation only has a single class for cars, which is why the class is hardcoded to 1. Otherwise it would need a dictionary lookup on the label category. Also, since I am only looking at cars (for now) I reformatted the JSON to drop everything except the bounding boxes, which I labelled ‘boxes’. That is why the original file says ‘box2d’ and my code says ‘boxes’.

The conversion from corner to center and shape is pretty straightforward. Compute the width and height, divide by 2, offset the center by half the width and half the height from x1,y1. Convert the width and height according to the YOLO formula using the ‘best’ anchor box, or prior, as the YOLO team calls them. The use of logit/sigmoid/exp in these relationships is per the equations provided in the v2 and v3 YOLO papers. They are explained in detail in other posts in this forum.