Determining the level of a Question in a Question paper

Hello Everyone. I require help regarding a research project. Me and my collegeaue are trying to take both text and a related image as input (a question from a question paper) and the outpet has to be the level of Question - Remembering, application , comparison etc. My colleague says we should process the image and the text separately , fuse the features generated and then give it to a BiGRU network. I’m thinking we might need some LLM because it would make a better general application + it might give us the features directly from both the text and image. What are your thoughts ? We will be using Python.