Intuition for generalizing to detect "big" and "small" objects?

Viewing the C4W3 lectures, I see early on (slide 9, “Sliding windows detection”) that different size boxes can capture different sized (or close / far from camera) cars, for example. This is intuitive: big boxes can frame big cars and small boxes, small ones. The lectures proceed quickly into using ConvNets to do object detection efficiently, but do not explicitly return to the concept of how our network can detect cars that are bigger or smaller. And then we don’t seem to return to this concept until “Anchor boxes” where it is sort of implied that the anchor boxes might be different sizes as well as aspect ratios.

I do not have a strong intuition for whether a network trained and run using the techniques shown would or would not generalize to detect cars (or other objects) of various sizes. Based only on my understanding from lectures, I’m guessing that it would not, unless our training data includes lots of data augmentation in the form of bigger and smaller versions of cars. At the same time, my intuition is that it would be more efficient to build in “various sized object detection” into a the architecture of a network rather than using oodles more augmented training data with the same car in lots of increments of different sizes. However, I am relatively new to this and am guessing.

Can the techniques in this week’s lecture notes help a network generalize to detecting objects of various sizes (e.g. the very same car, closer to or farther from, the camera) , or is that left to another part of the system or to data augmentation?

In case relevant, the YOLO algorithm (or rather, data labeling convention), IOU method, and non-max suppression seem clear to me.

Hey @am003e,
Apologies for the delayed response. The techniques presented in the lectures definitely work on differently-sized objects, and even the same object, captured differently, but to what extent, that depends on your application + dataset + training.

As of this, the good thing is that you can do experiments and found that out for yourself, whether the approaches presented in the lecture will work better than your intuitive approach or not. You can also try to find out the existing research work along these lines, so that you won’t have to repeat the same work, someone else has already done for you. Do share your results/findings with the community :nerd_face:

Cheers,
Elemento

Also note that I think the various evolving versions of YOLO are considered the SOTA for object detection these days. Nobody does “sliding windows” anymore for serious object detection work. YOLO is pretty deep waters, of course, but there are a number of great threads on the forum from fellow student ai_curious that explain various aspects of how it works and how to train a YOLO model. Here’s one that discusses the concept of Anchor Boxes, which are different than Bounding Boxes. There are links in that thread to other YOLO threads and you can use the forum search engine to find more.

For those reading along at home, generalizing to diversity in object scale was a known limitation of the earliest versions of YOLO ( mentioned in the published papers ). V3 introduced even more (!) complexity to deal with predictions at 3 scales simultaneously. If I’m not mistaken, the later versions do even more/better. Learning anchor boxes with different sizes and aspect ratios from the training data helped the algorithm make better predictions on different sized objects, but it was still training and predicting on similar distributions. Training on all small objects but predicting on large objects, or vice versa, is still going to cause problems. Train howya fight is still the best approach.

1 Like

@ai_curious Thank you for this detail! Glad to have my intuition confirmed even months after I finished the course. Frankly, your comments here belong in the lectures, not just in the discussion forums. Being able to detect objects of various sizes is of fundamental importance but it is basically ignored after being hinted at in the early lectures.

Training on all small objects but predicting on large objects, or vice versa, is still going to cause problems.

If or when the course is updated, I hope that it will include a plot or two which show this. The reason I signed up for this course was to have a “college course” type experience - with content curation, great explanations, and exercises - rather than a “journal club” type experience where I have to go read papers that are possibly relevant to the subject matter. Especially for something fundamental like variable sized object detection. Thanks again.