Optimal Document Text and Image Processing

Hello, in a recent client engagement, we were building out a RAG implementation leveraging product installation and maintenance documents as input to help field service technicians resolve installation and maintenance issues before calling a human “expert” for help. The images in the documents ranged from low to high complexity including tables, flowchart, photos, and CAD drawings. For the PoC we decided to exclude any image processing and just focused on chunking the text and storing embeddings in a vector db. My question is what effective approaches have you taken to process low/medium/high complexity images such that in a semantic search, results are passed to the LLM with document and page number references to both text and image hits are display to the user after the completion? Thank you for your help!