Improving recall in a dataset

tl;dr: How do I find objects in my data that my current model doesn’t know about without having to go through millions of images.

Imagine a scenario where you have a camera running an object detection algorithm in the middle of a jungle far, far away. You have a real-time constraint on running detections and moving data from your camera to your computer is somewhat expensive because there’s no broadband in the jungle. Also, imagine that annotating images is extremely expensive because objects are small and images are big.

If you want to expand your dataset you can send images where your detector thinks it’s found something back. The issue with this strategy is that you reinforce whatever bias the detector already has. You will never detect the objects you don’t already you don’t already detect.

A possible solution could be to collect a set of images completely at random throughout one week and keep this set fixed. Since annotating a single images takes a lot of time and a random image is unlikely to have an object of interest, the annotation process has to be semi-automated. Get predictions from the current detector and rank images by how likely it is that there’s at least one object in it. Prioritize the highest ranking images for labeling and then retrain the detector after annotating for a few days. After retraining, repeat the process with the new detector to annotate more images, excluding the ones a human has already annotated.
The assumption here is that images with objects of interest that are not currently not in our train-set distribution will still be ranked higher than images without such objects. (Our monkey-detector will probably rank an image with our missing and coveted gorilla-class slightly higher than an empty image*). After exhausting the images with the most obvious objects, the less obvious will “float to the surface” and we can get a significant fraction of these without looking through all the images we have (which is prohibitively expensive).
As time progresses the fraction of images you annotate that have an object of interest will go down and you can decide on some stopping criteria (e.g. when there’s more than 100 images between each object of interest, stop)

First question:
Are there other or complementing strategies to improve recall when a false negative do not have any noticeable effect? (It doesn’t make a car hit a pedestrian, it just makes our animal count incorrect)

Second question:
When gathering your dataset it is tempting to use compressed images to reduce transport and storage costs. In theory you want to make the training images as similar as possible to the images you run inference on, how much do compression artifacts affect object detection networks? I’ve found this to be hard to quantify properly. On smaller datasets, like cifar, this can have a huge effect on final accuracy.

  • This might not be true. In small object detection where each object of interest covers less than 1% of the pixels in an image, for example. :pleading_face:

Here is some pointers to get you started on relevant research:

A Simple Unified Framework for Detecting
Out-of-Distribution Samples and Adversarial Attacks

Improving Out-of-Distribution Detection in Machine Learning Models