How to prepare bounding box labels

There are 3 related but distinct steps.

First is labeled training data. These are a set of images and a text file that is often XML or JSON. The text file has a ‘record’ for each image that contains one category or class designation (car, traffic sign, driving lane, table etc) plus a location for each object in the image. Location is generally two corners of the bounding box. This data is not formatted for direct use in any kind of application, let alone a 19x19 Conv Net.

Second is this ground truth data reformatted into the structure of the using application. In your case, that is the CNN output format. This can be done 100% algorithmically, for example converting corners to center and shape and converting the label category to an integer for later use in the one-hot class vector. Generally only the location of the ground truth object center receives data, including the p_c = 1. All other locations are left blank (0).

Finally there is the output of forward propagation. Every location in the output structure contains predicted values. If the network has been trained well and generalizes well, there will be lots of low values for p_c and accurate location and class value predictions where predicted p_c is high. For outputs with low p_c there will still be predictions for location and class, but you won’t care what they are. It is comparison of these predicted outputs that have the higher p_c against ground truth that is performed during training to drive learning or evaluate accuracy.

Hope this helps.