Perhaps I’m starting to forget what I learned here-- Or rather the daunting task of putting it to use in the first place; But when we learned YOLO for video do we/did we have to augment data segmentation for every single frame (which would seem to be an entirely Promethium task)-- Or like even in modern encodings, videos do at least hold ‘key frames’.
I know in the course we discussed the breakdown of individual frames, but I don’t think it was ever expressed towards live encoding, where you have a stream of frames coming in.
In my present challenge I am wondering how this works.
I should ‘know better’, by this point, but realized I don’t.
YOLO forward propagation runs on single frames, both in training and operational contexts. You need to do training on enough diversity of signal that it has acceptable operational accuracy. This might be through substantial augmentation of few frames, or use of multiple frames. My initial take is that frames far apart in a video are likely more useful in training than sequential frames, since in the latter objects may not have changed their location, size and/or aspect ratio enough to materially impact learning. In any case, highly unlikely that you would use every frame during training.
3 Likes
Wrote a little more on this topic elsewhere, adding it here for future thread readers…
The original paper authors called each grid cell plus anchor box a detector. So for a 3x3 grid with 2 anchor boxes such as the one discussed in the lecture video, you would have 18 detectors. For a 19x19 grid with 5 anchor boxes, such as the one in the programming exercise, it would be 1,805. Each needs plenty of positive and negative examples, that is training images with objects centered in, or absent from, each detector location. In addition, depending on the application context, you probably want different aspect ratios and orientations, maybe different lighting conditions, examples of the different classes etc. so you can see why you need a few/several/many thousands of images to train a YOLO model from scratch. These can originate from a single image that has been manipulated aka augmented, or it can come from multiple frames of a video. The algorithm doesn’t know or care. All that matters is that there are enough that robust learning can occur.
At runtime, these learned weights are fixed and then applied exactly the same way with one forward propagation for each frame captured from the video. Hope that helps.
3 Likes