This course, developed in partnership with Intel, teaches you to build an interactive system for querying video content using multimodal AI. You’ll create a sophisticated question-answering system that processes, understands, and interacts with video.
Increasingly, language models and AI applications have added the capability to process images, audio, and video. In this course, you will learn more about these models and applications by implementing a multimodal RAG system. You will understand and use a multimodal embedding model to embed images and captions in a multimodal semantic space. Using that common space, you will build and use a retrieval system that returns images using text prompts. You will use a Large Vision Language Model (LVLM) to generate a response using the images and text from the retrieval.
By the end of this course, you’ll have the expertise to create AI systems that can intelligently interact with video content. This skill set opens up possibilities for developing advanced search engines that understand visual context, creating AI assistants capable of discussing video content, and building automated systems for video content analysis and summarization. Whether you’re looking to enhance content management systems, improve accessibility features, or push the boundaries of human-AI interaction, the techniques learned in this course will provide a solid foundation for innovation in multimodal AI applications.
In this course, you will make API calls to access multimodal models hosted by Prediction Guard on Intel’s cloud.