Hello Everyone. I require help regarding a research project. Me and my collegeaue are trying to take both text and a related image as input (a question from a question paper) and the outpet has to be the level of Question - Remembering, application , comparison etc. My colleague says we should process the image and the text separately , fuse the features generated and then give it to a BiGRU network. I’m thinking we might need some LLM because it would make a better general application + it might give us the features directly from both the text and image. What are your thoughts ? We will be using Python.
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| RAG with Multimodel Text and Images | 0 | 295 | May 9, 2024 | |
| I want to make image to text my own model | 2 | 33 | November 12, 2025 | |
| Arrangement of Seminar | 0 | 133 | December 3, 2023 | |
| Question which helps me to solve my startup | 0 | 102 | May 22, 2024 | |
| How to prompt text Q&A using llama3.2 | 1 | 199 | October 30, 2024 |