Here’s how I think of it. Suppose we want to do classification when there is one object in an image. You can run a CNN forward prop and easily generate a prediction, right? Cat. But how do you deal with images containing two objects when the network only produces a single output? The initial approach was to divide the input image and run the same classification network on all parts. If there is one object in each subdivided part, then we’re good. Except now you are doing lots more computation. And some of those regions may still have multiple objects. YOLO was a reaction to this challenge. How to deal with multiple objects, possibly near each other, but still run near real time. By introducing grid cells (number of grid cells is S in the YOLO papers) and anchor boxes, B, a YOLO CNN can output S*S*B classification predictions from a single forward pass. You kind of get the best of all worlds; high enough accuracy even on multi-object images at a very high frame rate. When it was introduced circa 2016, YOLO was competitive in accuracy with state of the art region-based approaches but was substantially faster, which is why it is still studied 6 years later. Hope this helps.