Improving recall in a dataset

Mikkelisk · August 5, 2021, 2:26pm

tl;dr: How do I find objects in my data that my current model doesn’t know about without having to go through millions of images.

Imagine a scenario where you have a camera running an object detection algorithm in the middle of a jungle far, far away. You have a real-time constraint on running detections and moving data from your camera to your computer is somewhat expensive because there’s no broadband in the jungle. Also, imagine that annotating images is extremely expensive because objects are small and images are big.

If you want to expand your dataset you can send images where your detector thinks it’s found something back. The issue with this strategy is that you reinforce whatever bias the detector already has. You will never detect the objects you don’t already you don’t already detect.

A possible solution could be to collect a set of images completely at random throughout one week and keep this set fixed. Since annotating a single images takes a lot of time and a random image is unlikely to have an object of interest, the annotation process has to be semi-automated. Get predictions from the current detector and rank images by how likely it is that there’s at least one object in it. Prioritize the highest ranking images for labeling and then retrain the detector after annotating for a few days. After retraining, repeat the process with the new detector to annotate more images, excluding the ones a human has already annotated.
The assumption here is that images with objects of interest that are not currently not in our train-set distribution will still be ranked higher than images without such objects. (Our monkey-detector will probably rank an image with our missing and coveted gorilla-class slightly higher than an empty image*). After exhausting the images with the most obvious objects, the less obvious will “float to the surface” and we can get a significant fraction of these without looking through all the images we have (which is prohibitively expensive).
As time progresses the fraction of images you annotate that have an object of interest will go down and you can decide on some stopping criteria (e.g. when there’s more than 100 images between each object of interest, stop)

First question:
Are there other or complementing strategies to improve recall when a false negative do not have any noticeable effect? (It doesn’t make a car hit a pedestrian, it just makes our animal count incorrect)

Second question:
When gathering your dataset it is tempting to use compressed images to reduce transport and storage costs. In theory you want to make the training images as similar as possible to the images you run inference on, how much do compression artifacts affect object detection networks? I’ve found this to be hard to quantify properly. On smaller datasets, like cifar, this can have a huge effect on final accuracy.

This might not be true. In small object detection where each object of interest covers less than 1% of the pixels in an image, for example.

iwant2know · August 8, 2021, 7:54pm

Here is some pointers to get you started on relevant research:

A Simple Unified Framework for Detecting
Out-of-Distribution Samples and Adversarial Attacks

Improving Out-of-Distribution Detection in Machine Learning Models

Topic		Replies	Views
Anomaly Detection: How to improve? AI Discussions ai-discussions , data-centric	1	320	October 31, 2022
How to see images that are not recognized by the detector model? AI Discussions ai-discussions	3	78	July 8, 2024
What could be the error in training the detection model? AI Discussions ai-discussions , conceptual-question	6	55	July 22, 2024
Object detection with varying size AI Discussions ai-discussions , data-centric	1	99	May 18, 2023
How much data does a CNN need to learn? Convolutional Neural Networks coursera-platform	17	1945	February 17, 2023

Improving recall in a dataset

Related topics