One area of recurring questions about YOLO is anchor boxes, introduced with the second 2016 paper, YOLO9000 aka YOLO v2, as an optimization to help improve model stability during training. I’ll defer discussion of the role they play for now and just cover how they are created. Once you see what they are, I think it will be easier to understand how they are used.
I worked with the Berkeley DeepDrive 100K dataset (BDD). To keep the size manageable, I used only the val subset, which contains 10000 images. In addition to the images, the val data contains labels in a JSON file for a variety of road-related object types that it calls categories: car, traffic sign, traffic light, rider, etc. I extracted the 102506 bounding boxes just related to cars. Each label of category car has several attributes, one of which is named box2d . From each box2d I extracted the label and value for x1, y1, x2, and y2. For this exercise, I don’t care about the position of the bounding boxes, only the sizes, so I calculated the width and height of each (that is, x_dim = x2 – x1, y_dim = y2 – y1).
Figure 1. Excerpt from bdd100k_labels_images_val.json
It turns out that of the 102506 bounding boxes, there are 88614 unique shapes. I counted and sorted the non-unique shapes to see how frequently they occurred. Here are the top 8:
Figure 2. Average IOU of Top 8 non-unique bounding box shapes with occurrence count
The total number of bounding boxes in the 8 most common shapes is 1,577. The average IOU of those 8 boxes versus the entire bounding box set is about 28%. We’ll come back to those numbers below. Notice, however, that these are small relative to the BDD image size of 1276 x 718 and all roughly the same shape: more or less square.
Per discussion in the published paper, YOLO v2 anchor boxes were derived from unsupervised learning on the ground truth labelled bounding box data. Specifically, they were determined by using K-means clustering. K-means attempts to partition a data set such that members of a cluster are closer to that cluster’s center than they are to any other cluster’s center. There are different distance, or similarity, measures one can choose. One that works well for this application is to minimize average IOU error between the cluster centroid and the elements of the clusters. Basically, we want to find a small set of boxes that are generally representative of the full data set. IOU allows us to compare the shapes of the small set to the shapes of the large set and measure just how similar they are on average.
I implemented K-means and average IOU in Python using pseudo-code and code found on the web, then created a simple loop to call them using my ground truth data and different number of cluster centroids. The idea is to pick the number of centroids that provides a reasonably low average IOU error. In the limit you could achieve 0 error by having one data element per cluster, but that would prove untenable, as it would blow up the size of the CNN output (remember the output dimensions is S x S x B x (1 + 4 + C). For this dataset, B = 102506, or even B = 88614 would be prohibitively expensive. I ran K-Means with a minimum of 3 clusters and a maximum of 10. My homegrown K-means implementation took less than 10 lines of code, or you can use a robust library function like sklearn.Cluster.KMeans and reduce that to 1 or 2. The exact results are sensitive to the type of initialization, number of iterations, and stopping conditions, and though I haven’t measured it all the way through YOLO training and testing, my belief is that the impact of these variations is small.
Plotting the average IOU versus the number of cluster centroids shows diminishing improvement (concave down). Notice that this plot corresponds exactly with the one shown in Figure 2: Clustering box dimensions on VOC and COCO in the YOLO 9000 paper. The YOLO team chose 5 centroids. I did the analysis below using 8. There is no ‘right’ answer for how many anchor boxes; it depends on your dataset, your runtime environment (cloud, laptop, mobile device, etc), and your project requirements (runtime throughput, accuracy, etc).
Here is my plot of the Average IOU versus number of Centroids,
Figure 3. Average IOU versus number of clusters
which looks very much like the one in the YOLO9000 paper:
Figure 4. Reproduction of Figure 2. from the YOLO 9000 paper
The table below contains avg IOU accuracy for the different number of centroids I tried:
Figure 5. Average IOU of computed anchor box shapes by Number of Centroids used
Computed shapes of cluster centers for 8 centroids:
Figure 6. Computed Anchor Box Shapes and Average IOU using 8 centroids
Note: The units in this table are pixels. Depending on how you’re going to use them, you might choose to store the values just like this, or normalized into units either of image size or grid cell size. That is, an anchor box width for a 416x416 image with 13x13 grid could be 32 pixels, or 1.0 grid cell relative, or 0.077 image relative. You’re free to convert back and forth depending on what you’re interested in computing. YOLO frequently uses grid cell relative so the units can directly be combined within the loss function, then scaling back to image relative at the end.
Several things to notice in these numbers compared to the table for the top 8 sorted bounding boxes. First, the sizes and shapes are varied; they are much less square and there is considerably more range in overall size. Most importantly, the average IOU is 2x to almost 3x higher than it was using the boxes selected by sorting on occurrence. This shows that the anchor boxes proposed by the K-Means approach are much more representative of the entire data set. This makes intuitive sense, given that the top 8 from sorting only covered about 1.5% of the bounding boxes in the data set and were of a very homogeneous shape. Sorting and counting might be a simple solution, but it’s not a very good one. The YOLO team (JP Redmon et al ) preferred K-means. Here it is in their own words using k-means to generate our bounding box starts the model off with a better representation and makes the task easier to learn.
The next step is to use the computed anchor boxes to set up the ground truth data. In order for the cost function to work naturally, the CNN output and the ground truth must have the same shape. The trick is to map the labelled ground truth bounding boxes to the best anchor box (spoiler alert: YOLO uses IOU) and grid cell location. That will be covered in the next thread, which will explain why using representative anchor boxes (the paper calls them priors ) improves network training over using only a naive approach or purely random initialization (equivalent to no anchor boxes at all).