I loved the course. Any advice on exploring multimodal approaches?

I am working on an interesting project requiring embedding of very specialized musical pedagogy and historical analysis.
Essentially I am aiming at a system which will provide quotes from various sources where the prompted concepts are discussed, keeping mentions separated while referencing them correctly.

One additional feature might be to return images of fragments of score, or to take a sung melody as an input (some of the sources include brief musical segments in machine readable format).

I realize it is a bit of a vague question but, could anyone suggest further resources about multimodal approaches involving Weaviate?