MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning

Wenhui Huang; Changhe Chen; Han Qi; Chen Lv; Yilun Du; Heng Yang

MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning

Wenhui Huang, Changhe Chen, Han Qi, Chen Lv, Yilun Du, Heng Yang

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language-action models, diffusion policy, mixture-of-transformers, unified fast-slow reasoning

TL;DR: This work presents MoTVLA, a vision-language-action model that integrates high-level fast-slow unified reasoning with low-level control, achieving interpretable and language-steered behavior policy learning.

Abstract: Integrating visual-language instructions into visuomotor policies is gaining momentum in robot learning for enhancing open-world generalization. Despite promising advances, existing approaches face two challenges: limited language steerability when no generated reasoning is used as a condition, or significant inference latency when reasoning is incorporated. In this work, we introduce MoTVLA, a mixture-of-transformers (MoT)–based vision–language–action (VLA) model that integrates fast–slow unified reasoning with behavior policy learning. MoTVLA preserves the general intelligence of pre-trained VLMs (serving as the generalist) for tasks such as perception, scene understanding, and semantic planning, while incorporating a domain expert, a second transformer that shares knowledge with the pretrained VLM, to generate fast domain-specific reasoning (e.g., robot motion decomposition), thereby improving policy execution efficiency. By conditioning the action expert on decomposed motion instructions, MoTVLA can learn diverse behaviors and substantially improve language steerability. Extensive evaluations across natural language processing benchmarks, robotic simulation environments, and real-world experiments confirm the superiority of MoTVLA in both language reasoning and manipulation task performance. We refer to https://motvla.github.io/MoTVLA-website/ for the demonstration videos and corresponding descriptions.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Submission Number: 8534

Loading