YOLO w/ video

Nevermnd · March 5, 2025, 2:52am

Perhaps I’m starting to forget what I learned here-- Or rather the daunting task of putting it to use in the first place; But when we learned YOLO for video do we/did we have to augment data segmentation for every single frame (which would seem to be an entirely Promethium task)-- Or like even in modern encodings, videos do at least hold ‘key frames’.

I know in the course we discussed the breakdown of individual frames, but I don’t think it was ever expressed towards live encoding, where you have a stream of frames coming in.

In my present challenge I am wondering how this works.

I should ‘know better’, by this point, but realized I don’t.

ai_curious · March 6, 2025, 2:48pm

YOLO forward propagation runs on single frames, both in training and operational contexts. You need to do training on enough diversity of signal that it has acceptable operational accuracy. This might be through substantial augmentation of few frames, or use of multiple frames. My initial take is that frames far apart in a video are likely more useful in training than sequential frames, since in the latter objects may not have changed their location, size and/or aspect ratio enough to materially impact learning. In any case, highly unlikely that you would use every frame during training.

ai_curious · March 7, 2025, 1:29pm

Wrote a little more on this topic elsewhere, adding it here for future thread readers…

The original paper authors called each grid cell plus anchor box a detector. So for a 3x3 grid with 2 anchor boxes such as the one discussed in the lecture video, you would have 18 detectors. For a 19x19 grid with 5 anchor boxes, such as the one in the programming exercise, it would be 1,805. Each needs plenty of positive and negative examples, that is training images with objects centered in, or absent from, each detector location. In addition, depending on the application context, you probably want different aspect ratios and orientations, maybe different lighting conditions, examples of the different classes etc. so you can see why you need a few/several/many thousands of images to train a YOLO model from scratch. These can originate from a single image that has been manipulated aka augmented, or it can come from multiple frames of a video. The algorithm doesn’t know or care. All that matters is that there are enough that robust learning can occur.

At runtime, these learned weights are fixed and then applied exactly the same way with one forward propagation for each frame captured from the video. Hope that helps.

Topic		Replies	Views
Course4 Week3: Understanding YOLO Algorithm Convolutional Neural Networks	5	816	March 18, 2025
Grids in YOLO Algorithm Convolutional Neural Networks week-3	6	412	January 15, 2024
YOLOv1 Research Paper Convolutional Neural Networks	9	556	July 10, 2021
Questions about YOLO Convolutional Neural Networks	13	2427	January 23, 2025
Detecting Multiple Objects using YOLO - Grid Cells plus Anchor Boxes Convolutional Neural Networks	6	1553	March 16, 2024

YOLO w/ video

Related topics