Subscribe for free access to
Data Points!
Researchers from the University of Washington and Allen Institute for AI have developed MolmoAct, a family of open-source robotic foundation models that integrate perception, planning, and control through structured reasoning. The models generate three types of tokens sequentially: depth perception tokens for 3D understanding, visual reasoning traces showing planned trajectories, and action tokens for robot control. MolmoAct-7B-D achieved 70.5 percent zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source models π0 and GR00T N1 (while taking much less time to pre-train), and 86.6 percent average success on LIBERO benchmarks. This more transparent approach to model trajectories in particular addresses some limitations in current vision-language-action models, making robot decision-making more explainable and steerable through visual trajectory editing. The team released all model weights, training code, and the MolmoAct Dataset containing over 10,000 robot trajectories. (arXiv)
