Apple’s multimodal models focus on data curation

Subscribe for free access to :arrow_forward: Data Points!

Apple introduced MM1.5, a new series of multimodal large language models designed to improve text-rich image understanding, visual referring and grounding, and multi-image reasoning. The models, ranging from 1 billion to 30 billion parameters, include dense and mixture-of-experts variants and demonstrate strong performance even at smaller scales. Apple’s approach focuses on careful data curation and training strategies, offering insights that could guide future research in multimodal large language model development. (arXiv)