# Lessons learned training YOLO from scratch on custom images

My YOLO v2 model operates, but it is not a very good learner. I spent time studying edge conditions in the data to see if I could figure out where noise might be creeping in. Here are a few things I discovered.

1. Ground Truth for small objects causes issues
Below is part of an image from the BDD training data I’ve mentioned in other posts. The first image shows ground truth in blue. The second image shows the same image with 32x32 pixel grid cells overlayed. Notice that the two grid cells have more than one object centered in them. By itself, that is not a problem for YOLO. However, if you follow the analysis below the images, you can see that they each are small objects that map to the same best anchor box shape (based on IOU). Therefore, only one of them can actually be in the ground truth the model is trained on. You can pick first in wins, or last in wins, or drop one of the anchor boxes to the ‘next best.’ Next best might be ok for some data sets, but for my images these are among the smallest bounding boxes and the next best is not very best at all. All three options are bad and lead to noise in the location accuracy calculations.

    #first small bounding box
x1 = 132
y1 = 361
x2 = 144
y2 = 374

cx, cy, width, height = corners_to_grid_cells(x1, y1, x2, y2)
best_anchors = get_best_anchors(width, height, use_anchors)

print('detector location for first object: (' + str(cx) + ',' + str(cy) + ',' + str(best_anchors[0][0]) + ')')
print('first object shape: [' + str(width*32.) + ' ' + str(height*32.) + ']')

print('best anchor shape: ' + str(anchors[best_anchors[0][0]]))
print('best anchor IOU: ' + str(best_anchors[0][1]))

print('second best anchor shape: ' + str(anchors[best_anchors[1][0]]))
print('second best anchor IOU: ' + str(best_anchors[1][1]))

#second small bounding box
x1 = 141
y1 = 358
x2 = 154
y2 = 369

cx, cy, width, height = corners_to_grid_cells(x1, y1, x2, y2)
best_anchors = get_best_anchors(width, height, use_anchors)

print('detector location for second object: (' + str(cx) + ',' + str(cy) + ',' + str(best_anchors[0][0]) + ')')
print('second object shape: [' + str(width*32.) + ' ' + str(height*32.) + ']')

print('best anchor shape: ' + str(anchors[best_anchors[0][0]]))
print('best anchor IOU: ' + str(best_anchors[0][1]))

print('second best anchor shape: ' + str(anchors[best_anchors[1][0]]))
print('second best anchor IOU: ' + str(best_anchors[1][1]))

detector location for first object: (4,11,5)
first object shape: [12.0 13.0]
best anchor shape: [22 20]
best anchor IOU: 0.35454545454545455
second best anchor shape: [54 43]
second best anchor IOU: 0.06718346253229975

detector location for second object: (4,11,5)
second object shape: [13.0 11.0]
best anchor shape: [22 20]
best anchor IOU: 0.325
second best anchor shape: [54 43]
second best anchor IOU: 0.0615848406546081
1 Like
1. Resizing / cropping images to fit YOLO v2 default square shape and x32 downsampling causes issues with bounding boxes

I cropped out a 608x608 section of the much larger Berkeley images. At first I ignored any labels (bounding boxes) that straddled the image edges and just used ones that were completely within the cropped area. But I noticed that there are many images with cars (the only class I am looking at for now) right on the edge of the image. By omitting these from my ‘cropped’ ground truth, I thought it might be causing trouble for the learner. But when I went back to include portions of the cropped ‘edge’ objects, it resulted in some other weird looking bounding boxes that no longer follow the shape of the car.

The image above shows the process in two stages. The image is uncropped. Originally, the bounding box for the white SUV extended to include the back door and the rear of the vehicle. The red box shows where the label for the white SUV will be once the image is cropped.

Here it is after the crop. It isn’t obvious why the bounding box extends all the way to the top and bottom of the image. I didn’t want to try to figure out the angle the object was at relative to the camera and adjust for things like the back door of the car getting cropped out. So I’m telling my model that sometimes ‘cars’ include pixels that look a whole lot like buildings and trees. The car on the left side is an example of a cropped/partial bounding box that didn’t cause an issue because of its orientation to the camera.

1 Like
1. Boundary conditions on the YOLO prediction formulas have to be protected against

As detailed in other related posts, the output of the YOLO CNN is submitted to various transformations involving log and exponential functions. Sometimes these don’t play nice with zeros and negative numbers. For example, when creating the ground truth you can end up with an object center right in the corner of a grid cell, with offset zero. In other words, 0 = \sigma(t_o). This makes the library functions grumpy. I also ended up with some bounding boxes of less than 1 pixel during the cropping phase, which rounded to 0 width or height. In these cases I ended up adding some little deltas or just dropping near 0 area bounding boxes to prevent math problems downstream.

Hope this helps

1 Like

I recognize that I’m kinda remotely knocking down trees in the forest with these posts, but here is another realization that took me a while to figure out and might save someone a little headache

1. The data sets are SPARSE! Which you need to overcome by using lots of images and/or augmenting. Which means you need a strategy for dealing with data sets much larger than can fit in memory at one time. Here’s what I mean. I decided to start out with small data sets so that a) I could run them on my laptop and not have to pay for CPU/GPU time and b) so that I might be able to do some hand calculations to check what was going on. For my 19x19x8 grid + anchor box configuration and 16 training images, there are 46,208 prediction locations. But in these 16 images there are only 92 labelled objects. Assuming they are all in unique locations, that works out to, eh, divide …carry the one…um yeah a really low percentage of populated cells. Way over 99% of the ‘detectors’ as JP Redmon calls them never see any labelled objects during training, so not a surprise that my model doesn’t generalize well. Still debating approaches to remedy.

Below is what the 19x19 grid cell map looks like. The integers represent the number of labels from my 16 training images that fell in those grids. Lots o’ zeros!

[[0 0 0 0 2 0 1 1 2 1 1 1 0 0 0 0 0 0 0]
[0 0 0 0 2 0 1 3 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 3 1 1 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 2 0 2 2 2 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 2 2 2 2 1 1 0 2 0 0 0 0 0 0 0]
[0 0 0 0 1 1 1 0 0 1 1 3 0 0 0 0 0 0 0]
[0 0 0 2 2 1 2 4 0 1 0 1 0 0 0 0 0 0 0]
[0 0 0 0 2 0 1 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 2 1 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 2 1 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]]

1 Like

This sparseness sounds like a really salient observation. Mind you, I don’t have any personal domain knowledge w.r.t. YOLO: all I know is what Prof Ng said in the lectures and the notebooks and what I’ve learned by reading your posts on subject. I have not read any of the original papers. But it can’t be the case that this observation didn’t apply to Redmon et al when they cooked up this stuff in the first place. Do they keep their training methods secret to hold some proprietary advantage? Or is it just that they have essentially infinite resources and can use the full compute power of Google’s backend infrastructure, so they just train on so much more data that they achieve generality by essentially brute force?

From the original paper:

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26].
We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.

I should go and look at the data set he mentions. I’m not sure how big it is. But a couple of things I do know 1) I am using the entire pipeline for training and not decomposing. 2) I am not training ‘for approximately a week’ !!!

Here is the relevant section from the YOLO 9000 paper:

Training for classification.
We train the network on the standard ImageNet 1000 class classification dataset for 160 epochs using stochastic gradient descent with a starting learning rate of 0.1, polynomial rate decay with a power of 4, weight decay of 0.0005 and momentum of 0.9 using the Darknet neural network framework [13]. During training we use standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts.

As discussed above, after our initial training on images at 224 × 224 we fine tune our network at a larger size, 448. For this fine tuning we train with the above parameters but for only 10 epochs and starting at a learning rate of 10−3. At this higher resolution our network achieves a top-1 accuracy of 76.5% and a top-5 accuracy of 93.3%.

Training for detection.
We modify this network for detection by removing the last convolutional layer and instead adding on three 3 × 3 convolutional layers with 1024 filters each followed by a final 1 × 1 convolutional layer with the number of outputs we need for detection. For VOC we predict 5 boxes with 5 coordinates each and 20 classes per box so 125 filters. We also add a passthrough layer from the final 3 × 3 × 512 layer to the second to last convolutional layer so that our model can use fine grain features.

We train the network for 160 epochs with a starting learning rate of 10−3, dividing it by 10 at 60 and 90 epochs. We use a weight decay of 0.0005 and momentum of 0.9. We use a similar data augmentation to YOLO and SSD with random crops, color shifting, etc.

Again, I am using a much smaller dataset, am not separating detection and classification training, and not augmenting the BDD images. I am beginning to understand why they went to all that effort though (and why almost every article you can find online includes the phrase ‘we started with a pretrained model…’). I initially thought these gymnastics were there to squeeze out the last few possible points of accuracy and precision. Now I think they are required to get it to do anything useful at all.

Found on the web…

The ImageNet dataset consists of three parts, training data, validation data, and image labels. The training data contains 1000 categories and 1.2 million images, packaged for easy downloading. The validation and test data are not contained in the ImageNet training data (duplicates have been removed).

1.2 million images ROFL

I have been working with 16

Started working on an improved data set for YOLO training. I dropped the image size from 608x608 to 416x416, and instead of doing a single crop from each BDD training image, I wrote a crawler that does 54 separate crops. As you can see, this moves the crop around the original image and repositions the labelled objects, resulting in at least an order of magnitude increase in unique grid cell locations getting populated with data (even though I also reduced the grid to 13x13, still have 10x more cells populated).

.
.
.

1 Like