How to improve objects detection in a noisy video stream?

Hello All,
I hope you are well. I have a specific use case and would like to get advice on how to solve it with better efficiency and quality.

I know that usually when we prepare a dataset to detect an object in the video, we split the video into separate frames and then label the frames using e.g. bounding boxes. Then during detection, we split the input video and analyze it frame-by-frame (maybe there’s a better approach- please let me know regardless of whether it matches the case below or not).

I believe this frame-by-frame approach will be inefficient for me. There are cases when the video is noisy and people can detect objects only when objects are moving. Just as an example, pls don’t take it literally: a small bird, far (it’s important) in the forest. There are branches, it’s windy so branches are moving, the camera is not 100% stable, etc. We can say with ease when the bird starts to fly but no way to see it when it sits still. The object is far away, thus on the frame the shape of the object is unclear, takes e.g. 5x10 pixels in 800x600 frame, and can be confused with a branch with ease. It is not a motion detection case because branches (clouds, shadows) are moving too.

So, how to improve detection here? I thought about checking a sequence of 3 or 5 frames like we check words in a sentence (not 1-by-1 but as a set) but I’m afraid it will be too compute-intensive.

p.s. if it matters - the video is not a file, it is a live stream. Also, I’m not interested in 99.99% of objects that are usually present in ready-to-use base models (furniture, food, road signs etc etc), so most likely I’ll have to build the model from scratch because the variety of the objects I want to detect is relatively small and the detection hardware to run the model should be as cheap as possible (please wish me good luck and suggest a good course for this!).

Just tossing in a few thoughts, this isn’t really my area of expertise.

You really don’t have a choice. A video is always represented as a time-sequence of individual frames. That’s the native data format, you’ll have to deal with it.

It’s still a time-sequence of frames. What the “live stream” does to you is force your solution to keep up in real time. Sometimes this is done via raw computing power - sometimes it’s done by discarding ‘n-1’ of every ‘n’ frames to reduce the data throughput requirements.

Motion detection is often done by simply looking at the changes between frames. So you could pre-process the video so that all of the frames represent only the differences between frames.

You’d need some other process that detects (using a time sequence perhaps) when a previously moving region of the frame buffer has stopped moving.

You might be able to ignore motion that only happens within a small region for for a small duration of time, as a way to cope with branches and wind.

The only course I know of that comes close to this is the YOLO portions of the Deep Learning Specialization - Convolutional Neural Networks (Course 4). The example used there is detecting vehicles in traffic.

1 Like

Did you try using YOLO?
It has a nano version of its model, very light, to work on stream videos.

I’m thinking you can train it with this low-quality images; maybe it will recognize this blurry pictures, because it has been trained on it.