MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language-action models, diffusion policy, mixture-of-transformers, unified fast-slow reasoning
TL;DR: This work presents MoTVLA, a vision-language-action model that integrates high-level fast-slow unified reasoning with low-level control, achieving interpretable and language-steered behavior policy learning.
Abstract: Integrating visual-language instructions into visuomotor policies is gaining momentum in robot learning for enhancing open-world generalization. Despite promising advances, existing approaches face two challenges: limited language steerability when no generated reasoning is used as a condition, or significant inference latency when reasoning is incorporated. In this work, we introduce MoTVLA, a mixture-of-transformers (MoT)–based vision–language–action (VLA) model that integrates fast–slow unified reasoning with behavior policy learning. MoTVLA preserves the general intelligence of pre-trained VLMs (serving as the generalist) for tasks such as perception, scene understanding, and semantic planning, while incorporating a domain expert, a second transformer that shares knowledge with the pretrained VLM, to generate fast domain-specific reasoning (e.g., robot motion decomposition), thereby improving policy execution efficiency. By conditioning the action expert on decomposed motion instructions, MoTVLA can learn diverse behaviors and substantially improve language steerability. Extensive evaluations across natural language processing benchmarks, robotic simulation environments, and real-world experiments confirm the superiority of MoTVLA in both language reasoning and manipulation task performance. We refer to https://motvla.github.io/MoTVLA-website/ for the demonstration videos and corresponding descriptions.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 8534
Loading