Guidance Needed on Integrating VQA into a Home Object Detection Project

Hi everyone,

I’m working on a project that uses a live video feed for home object detection. The system captures images of detected objects and organizes them into specific folders, making it easy to locate items when needed.

My next goal is to add a Visual Question Answering (VQA) component. For example, when I ask “Where are my keys?”, the system should provide a context-aware answer (e.g., “Your keys are on the table in front of the water bottle”). I’m exploring different approaches such as combining Graph Neural Networks with CNNs and using vision transformers (ViT) and some other approach i am confused what is better approach for this project and how i can solve this problem.

Could anyone share insights or recommend techniques for successfully implementing VQA in this context?

Thank you in advance for your guidance!
computer-vision
Best regards,
Karthik