Understanding Challenges in Multimodal LLM Development

Hi all! I’m working in the AI-Evaluation space building developer tools for tracking performance improvements in LLMs. One of the things we want to do is build tools that can help developers with multi-modal LLM (MLLM) development.

I’d love to hear what specific issues you face in building MLLM-based applications today? Is it in developing the right benchmarks, having the right infra for iterating quickly, collecting feedback, or something else?