Detecting Multiple Objects using YOLO - Grid Cells plus Anchor Boxes

Detecting Multiple Objects in YOLO – Grid Cells plus Anchor Boxes

One key feature of YOLO is its ability to detect (locate plus classify) multiple objects per image. Previous threads have introduced the concept of anchor boxes and how they are applied during training data setup. This thread will cover how grid cells and anchor boxes together enable predictions on multiple objects within one image.

The motivation

Think back to your first efforts to train a classifier for image based on a single object. You initialized weights, often randomly, ran a forward propagation, then evaluated the difference between the output of the network and the ground truth data. The object in the image, and thus the image itself, is determined to be either a cat or not-cat. Turns out it is straightforward to extend that to handle multiple output classes, and even to add prediction of a bounding box for that single object. But handling multiple objects presents new challenges. Circa 2016, when the first YOLO paper was published, the best results in terms of accuracy came from breaking up the input image into regions, and running the neural network on each region. Sometimes this meant two separate networks, one for producing region candidates, and one for detecting within those regions. The results were either accurate or fast, but never both.

The innovation was for YOLO to stick with a single input region, and instead to subdivide the structure of the output of the neural network. That is, rather than producing a single classification prediction multiple times, the YOLO network produces multiple classification predictions all at once. The trick is to see the output of the network as a 3D object of what the paper called detectors, each of which can make its own (set of) prediction(s).

The implementation

The following figure depicts a notional 3D network output shape using the symbols used in the YOLO papers:

Figure 1. Notional 3D CNN output shape for S = 3 and B = 5

Here S represents the grid cell dimension, and B represents the number of anchor boxes. The idea is that by shaping the last layer of the CNN this way, it can output S*S*B values. In this example it is 3 * 3 * 5 = 45. A CNN with this shape could make 45 cat/non-cat predictions on the same input image from the same, single forward propagation. By adding a 4th dimension you can even produce more output values. Say C represents 4 bounding box location predictions. Now you have S*S*B*C values. 3 * 3 * 5 * 4 = 180 predictions. Note, you could think of the previous depiction as C = 1.

Figure 2. Notional 4D CNN output shape for S = 3, B = 5, and C = 4

In the notebook used for the car detection exercise, S = 19, B = 5, and the 4th dimension contains not only the 4 bounding box predictions (b_x, b_y, b_w, b_h), but also the object presence prediction (p_c), and the 80 class probability predictions (c_i). In other words 19 * 19 * 5 * (1 + 4 + 80) = (19*19) * (425) = 153,425 predicted values, output simultaneously from the same single forward propagation.

Anchor Box-specialized Detectors

Hopefully it is clear from the above the role that anchor boxes play in enabling YOLO to handle multiple objects in an image. But it is important to link this to the discussion in the previous thread about best anchors . Remember that during training, each labelled object is assigned to one specific grid cell + anchor box location, or detector, based on IOU between that anchor box shape and the object ground truth. Training reinforces the association of anchor box shape with object ground truth, in effect specializing cells in the 3D detector array to predict certain object shapes more accurately than others. The ‘short-and-wide’ anchor box location is more likely to fire when there is a car than when there is a person, which will be detected by the ‘tall-and-narrow’ anchor box location. This is what enables YOLO to not only detect multiple objects within an image, but to detect multiple objects of different shape in the same location of the image, such as a person standing in front of a car, or a dog in front of a bicycle.

Grid cells, Anchor boxes, and Non-max-suppression

It turns out that while grid cells and anchor boxes elegantly address the requirement to detect multiple objects within an image, they create a new problem: duplicate predictions of the same object by multiple detectors. This situation can arise, for example, when an object center is near the boundary of two grid cells. Then multiple detectors in the two neighboring grid cells might each predict the object. How can you distinguish whether two predictions are of the same object, in which case the duplicate should be suppressed, versus the case where two predictions are really two separate objects? YOLO uses non-max-suppression, which under the covers uses IOU. Here’s how it works.

After all the 45 or 180 or 153,425 predictions are made, the ones with confidence below a confidence threshold are suppressed immediately. The remaining are examined for uniqueness. Starting with the highest confidence prediction, compare it using IOU to all the others that passed the threshold. If the IOU is 1, that means the location and shape are exactly the same and thus likely to be duplicates. If the IOU is 0, that means the location and shape are disjoint, and both should be kept. Anything in between can be subject to a tuneable IOU threshold value. You want to keep objects that are in the same location but with different shapes (the person in front of a car scenario would have a low IOU) but reject objects that are so similar that they are likely duplicates (two predictions of the same person, or of the same car). Notice that you shouldn’t just pick the single highest confidence prediction; you might end up keeping the person and throwing away the car, or vice versa, when the correct result is keeping both.


Grid cells and anchor boxes enable the YOLO CNN to detect (classify + locate) multiple objects per input image from a single neural network forward pass. For many types of images, this means both fast and acceptably accurate. Grid cell + anchor box locations, or detectors , are specialized during network training by using the best anchor concept. Using different shaped anchor boxes helps the network predict multiple objects of different shapes even when they are close or overlapping. A weakness is predicting multiple objects of the same shape when they are close or overlapping.

Final Thought

Anchor boxes and grid cells enable YOLO to quickly and accurately detect multiple objects per input image. Well-chosen anchor box shapes help the network make better shape predictions earlier in training. Non-max-suppression is needed to protect against multiple predictions of the same object; just using ranked confidence alone could lead to useful predictions being ignored and mitigate some of the value of the multiple object predictions generated by the grid cell plus anchor box architecture.


Thanks for your presentation.

Here is what this looks like in terms of the YOLO v2 model itself. I built the CNN using the 608x608 Berkeley Driving Data image used in the previous thread, a 19x19 grid shape, 8 dimension clusters/anchor boxes, and 1 class (cars only for now). Or S*S*B*(1+4+1) = 19*19*8*6

You can see the 608x608x3 input shape in the input layer, and the 19x19x8x6 shape in the output layer. The Conv2D, BatchNorm, MaxPool, LeakyReLU etc layers as well as the filter number, stride, and padding are taken right from the YOLO v2 paper, including the skip connection between conv2d_13 and conv2d_20 (not shown in this excerpt)

1 Like