Here is an example of a labelled training data file:
{ "images":
[
{
"name": "b1c66a42-6f7d68ca.jpg",
"attributes": {
"weather": "overcast",
"scene": "city street",
"timeofday": "daytime"
},
"timestamp": 10000,
"labels": [
{
"category": "traffic sign",
"attributes": {
"occluded": false,
"truncated": false,
"trafficLightColor": "none"
},
"manualShape": true,
"manualAttributes": true,
"box2d": {
"x1": 1000.698742,
"y1": 281.992415,
"x2": 1040.626872,
"y2": 326.91156
},
"id": 0
}
],
}]
Here is an example of converting the ‘boxes’ part to training data for a YOLO CNN:
for box in training_image['boxes']:
ground_truth_boxes += 1
x1 = int(box['x1'])
y1 = int(box['y1'])
x2 = int(box['x2'])
y2 = int(box['y2'])
raw_box = [x1, y1, x2, y2]
print('GT data: ' + str(x1) + ',' + str(y1) + ',' + str(x2) + ',' + str(y2))
#convert to x,y w,h. BDD JSON ground truth data is in image-relative pixels
bx, by, bw, bh = convert_corners_to_YOLO_format(x1, y1, x2, y2)
#find grid cell for center (x,y) and adjust center coords
cx, cy = get_grid_cell(bx, by)
tx = logit(bx - cx) # scale down by removing grid cell index offset
ty = logit(by - cy) # scale down to 0 <= ty <= 1
#find best anchor
best_anchor = get_best_anchor(bw, bh, anchors)
tw = np.log(bw / anchors[best_anchor][0]) #t_w == log of w ratio
th = np.log(bh / anchors[best_anchor][1]) #t_h == log of h ratio
#write training data entry into (m,GRID_W,GRID_H,(1+4+1))
Y[image,cx,cy,best_anchor,0] = 25.0 # object present ~ sigmoid(1.) Only used in testing when GT == predicted
Y[image,cx,cy,best_anchor,1] = tx # inverse sigmoid of grid-cell relative x offset
Y[image,cx,cy,best_anchor,2] = ty # inverse sigmoid of grid-cell relative y offset
Y[image,cx,cy,best_anchor,3] = tw # inverse exp of ratio of GT w to width of best anchor
Y[image,cx,cy,best_anchor,4] = th # inverse exp of ratio of GT h to height of best anchor
Y[image,cx,cy,best_anchor,5] = 1. #FIXME cars only for now. need class lookup + one_hot when there are more
NOTE: my YOLO investigation only has a single class for cars, which is why the class is hardcoded to 1. Otherwise it would need a dictionary lookup on the label category. Also, since I am only looking at cars (for now) I reformatted the JSON to drop everything except the bounding boxes, which I labelled ‘boxes’. That is why the original file says ‘box2d’ and my code says ‘boxes’.
The conversion from corner to center and shape is pretty straightforward. Compute the width and height, divide by 2, offset the center by half the width and half the height from x1,y1. Convert the width and height according to the YOLO formula using the ‘best’ anchor box, or prior, as the YOLO team calls them. The use of logit/sigmoid/exp in these relationships is per the equations provided in the v2 and v3 YOLO papers. They are explained in detail in other posts in this forum.