Multimodal Audiovisual saliency prediction

Hello, I want to conduct a research on Multimodal AudioVisual saliency prediction. I appreciate anyone who help me in doing so.
Could you please suggest me some ways how to develop and any resources to kickstart with this.