ML Question on extracting features from sequence of images

What model would you recommend for creating one image to feed to a YOLO model? But that input image is to have features from a sequence of images captured by a camera.

1 Like

Hey @VIVEK_Mehrotra,
Welcome to the community. Please help us understand your question a bit more.

As per the above statement, I thought that you want to artificially create data for an object detection model. But the next statement, says that these generated images should have features from sequence of images, each.

For starters, what kind of features are you referring to here? To the best of my knowledge, I haven’t heard about the concept of extracting features from a sequence of images to create a new image.


I think we need a bit more information to know how to answer this. Are you talking about training a YOLO model or just using a YOLO model that has already been trained? If the latter, then there is nothing special you have to do other than perhaps resize the input image to the size and format that the model expects. A trained YOLO model takes individual images and returns the identifications of the objects it is able to recognize in each image, in the complex form described in the lectures and the assignment.

Training a YOLO model is (as you would expect) a much more complex task, given that it requires labelled images, which in the case of YOLO is pretty complex, including anchor boxes, bounding boxes (not the same thing) and object classes. There are a number of really great threads here on the forums about YOLO from one of our fellow students who has invested serious time and effort working with YOLO and has shared the knowledge gained. If you want to dig deeper than what you can learn from the lectures and the YOLO assignment, here’s a good thread to start with and it provides links to some of the others.

1 Like

Thank you for clearing my understanding. One input image does not have all the information. A sequence of images may carry the complete information. I am talking about using the model, not training.
Looks like you are suggesting each input image would be processed by the model and the final outputs then looked at and deduced outside the process of the ML model? Thank you for sharing.

YOLO is designed to solve a different problem. It takes images input one at a time, runs each one through its forward propagation and outputs a set of predictions regarding object localization and classification from that image and that image alone. The YOLO pipeline does not deal with fusing features or information from multiple input images.

+1 on what @paulinpaloalto suggests above, which is the community needs more information about what information derives from multiple images. Object movement, perhaps?

If it is object movement, or tracking, you might find some useful ideas here: OpenCV Object Tracking - PyImageSearch

Looking forward to learning more

1 Like

License plate for instance

Once in production, will you be comparing the license plate with one stored in a database to, say, grant access?

There is a similar lab in Deep Learning Specialization for detecting faces to grant access.

Not comparing from database, but a ML problem of predicting licence numbers on the fly after training from an input stream of frames

Was wondering if a model could be applied to the stream first and then a single image to YOLO

Curious about it: by “predicting license numbers on the fly” you mean "predicting what’s the next license to pass by a certain place, for instance?

For instance, in a shopping center with an entrance… this system would predict what’s the next license plate coming in? or going out?

You can have a video stream, yes, and you can identify the license plates of the vehicles, yes, and assuming enough resolution, you can read the license plates, all this with YOLO, and even more efficiently and accurately with YOLO v8. And this would be a real time identification.

Check out the image below - real time identification of objects.

This GitHub is an example of YOLO used to detect license plates.

And this other project reads the plates.

Another option to implement what I think you are trying to do (Read license plates) is by using OpenCV, a very good library for computer vision.

1 Like

Thank you very much, Juan. This ML community is so helpful…

1 Like

No not predicting the next one. Just the current one from a stream

Got it. Then, some of the above links will certainly help you.

It sounds to me like there is no information from the sequence of images. You grab a frame from the video, process it. Grab another frame, and process it independently. You are not trying to track object movement from frame to frame, which is something an autonomous vehicle, for example, needs to do. Do an interweb search on machine learning license plates and you’ll find a bunch of hits.

Check out this youtube:

Live and moving tracking of objects. This one was done with YOLO v3

Does it look like what you are looking for?

1 Like

I’m not convinced that any version of YOLO does tracking, which requires localization, classification and identification. By that I mean it has to know that the object classified as a dog in frame 1 and the object classified as dog in frame 2 are the same object. Or be able to distinguish and follow all the individual humans at the subway from frame to frame. Not just put a class labelled frame around them. In addition, I am not aware that any version of YOLO can read characters on a license plate. To the best of my knowledge it can only localize and classify. Very confident of this for v1, v2 and YOLO 9000, but I will reread the v3 and v4 papers at earliest opportunity.

May be a combination of models?

With YOLO v8 you can certainly identify the license plates in real time, in motion. So the next step is to extract the data from each license plate. If YOLO cannot read the license plates, we can extract them and pass them to OpenCV to read them.

Since Yolo v8 was released, there’s been plenty of posts with sample applications, some include the license plates project provided above.

@ai_curious what do you think?

PS: I’ve been looking at previous versions of YOLO and even from YOLO v3 there’s the capability to localize, classify and identity objects in real time, in motion. Here’s one more link, now with YOLO v3, showing how to do it.

1 Like

I do think it will require an ensemble to read characters on license plates. One part or model to localize the plate, one part to do OCR on the alphanumeric of each plate image region.

On the linked v3 how-to video at the 22:50 mark the narrator says ‘running it as frames’ which is what I mention above. Each set of predictions is run independently on a single frame image input. Because the predicted bounding boxes from the sequence of frames appear to move when overlaid on the video, it looks like tracking is happening. But I don’t think it is…it is only localization (and visualization) happening quickly. I think it is only the human observer that is connecting the bounding box ‘movement’, like drawings on a stack of cards flipping past your eyes, and doing the ‘tracking’, not YOLO v3.

Thank you both for this interesting piece of information. I am thinking detection followed by segmentation followed by OCR. Which models to use for the 3 pieces is not clear still.