Using RAG to teach a (image + text) scanned PDF doc to the model

I have a PDF document which is nothing but a scanned version of a book which contains both text and images. I would like to use RAG on a opensource model and generate a chatbot that can take in any modal (text/image/audio/video) and output text/image in the context of the document.

Is this something doable after finishing “Building Multimodal Search and RAG” course?