Output layer for detecting same object (a bounding box) multiple times in an image


I’m unable to find information on how to do get multiple bounding box output for an image looking for an object which is present in the image at least once.

I’ve seen some existing projects being able to do that like craft text detector which draws bounding boxes around all detected text.

Any help is greatly appreciated! :blush:

Joseph Redmon at the podium :man_bowing:

YOLO can output multiple bounding boxes per forward pass per image. However, it is designed to generate a single bounding box per object. Generally multiple bounding boxes for the same object is undesirable. Perhaps you can elaborate on the functional/business requirement to help us make better recommendations.

Thanks for the response.

I checked out YOLO but it is not what I need. I want to train the network to detect all occurrences of one specific object in the provided image. What I want to accomplish is extract regions of interest in a scanned specially marked document.

Detecting all the faces in an image is a similar use case, which also most mobile phones are able to do in real time.

Common terminology is that objects are unique instances, and thus occur only once within an image. If there are two faces in an image those are two objects, although one class. Unless your objective is to go further and recognize that the two faces are actually the same person. In that case I think you’re looking at at least a two-phase process- one to detect (localize plus classify) then a second one to interpret (facial comparison, read characters from a license plate etc)

I think I did not word it right. I want to detect all objects belonging to the same class. I understand that all the objects are unique, but I want them all detected.

1 Like

YOLO can detect (localize plus classify) multiple objects of same or different classes in an image. You can always filter out objects with uninteresting class(es) afterwards. If you watched the video, Mr Redmon also talks briefly about other approaches that were state of the art circa 2015 when YOLO was invented (eg Deformable Parts, Regional CNN, Fast RCNN) and how they compare in speed and accuracy. It also shows at least one image with several airplanes, each with its own bounding box and class label, which seems like what you are interested in doing.

To the best of my knowledge that was the first public presentation of YOLO, so it’s an historic event and worth the 13 minutes to watch even if you end up using something else in your project.

Random YOLO screen capture found on the web…

It wasn’t shown in this example (above) but a YOLO-based system could be used in near real time to detect the license plates, then read the license plates, determine whether the license plate was carried by a car or a truck in order to charge the vehicle owner an appropriate toll/tariff.

Something similar is described here: