MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning
Keywords: Motion Decomposition, VLA, Fast-Slow Reasoning
TL;DR: This work presents MoTVLA, a vision-language-action model that integrates high-level fast--slow reasoning and motion decomposition with low-level action learning, enabling interpretable and language-steered behavior policy learning.
Abstract: Integrating visual-language instructions into visuomotor policies is gaining momentum in robot learning for enhancing open-world generalization. Despite promising advances, existing approaches face two challenges: limited language steerability when no generated reasoning is used as a condition, or significant inference latency when reasoning is incorporated. In this work, we introduce MoTVLA, a mixture-of-transformers (MoT)–based vision–language–action (VLA) model that integrates fast–slow unified reasoning with behavior policy learning. MoTVLA preserves the general intelligence of the VLM (serving as the generalist) for tasks such as perception, scene understanding, and semantic planning, while incorporating a domain expert, a second transformer that shares knowledge with the generalist, to generate fast domain-specific reasoning (e.g., robot motion decomposition), thereby improving policy execution efficiency. By conditioning the action expert on decomposed motion instructions, MoTVLA can learn diverse behaviors and substantially improve language steerability. Extensive evaluations across natural language processing benchmarks, robotic simulation environments, and real-world experiments confirm the superiority of MoTVLA in both language reasoning and manipulation task performance. We refer to \href{https://motvla.github.io/MoTVLA-website/}{Project Page} for the demonstration videos and corresponding descriptions.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 188
Loading