I hope you are well. I have a specific use case and would like to get advice on how to solve it with better efficiency and quality.
I know that usually when we prepare a dataset to detect an object in the video, we split the video into separate frames and then label the frames using e.g. bounding boxes. Then during detection, we split the input video and analyze it frame-by-frame (maybe there’s a better approach- please let me know regardless of whether it matches the case below or not).
I believe this frame-by-frame approach will be inefficient for me. There are cases when the video is noisy and people can detect objects only when objects are moving. Just as an example, pls don’t take it literally: a small bird, far (it’s important) in the forest. There are branches, it’s windy so branches are moving, the camera is not 100% stable, etc. We can say with ease when the bird starts to fly but no way to see it when it sits still. The object is far away, thus on the frame the shape of the object is unclear, takes e.g. 5x10 pixels in 800x600 frame, and can be confused with a branch with ease. It is not a motion detection case because branches (clouds, shadows) are moving too.
So, how to improve detection here? I thought about checking a sequence of 3 or 5 frames like we check words in a sentence (not 1-by-1 but as a set) but I’m afraid it will be too compute-intensive.
p.s. if it matters - the video is not a file, it is a live stream. Also, I’m not interested in 99.99% of objects that are usually present in ready-to-use base models (furniture, food, road signs etc etc), so most likely I’ll have to build the model from scratch because the variety of the objects I want to detect is relatively small and the detection hardware to run the model should be as cheap as possible (please wish me good luck and suggest a good course for this!).