DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

ICLR 2026 Conference Submission22630 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Embodied Reasoning, Vision-Language-Action Models, reinforcement learning, Robot Learning
TL;DR: We introduce DeepThinkVLA, a VLA that couples an autoregressive (CoT)–to–parallel (action) hybrid-attention decoder with an SFT to RL training pipeline, aligning reasoning with control via outcome-based rewards.
Abstract: Enabling Vision-Language-Action (VLA) models to "think before acting" via Chain-of-Thought (CoT) is a promising path to overcoming the data-hungry nature of end-to-end robot policies. However, progress is stalled by a fundamental conflict: existing models use a single autoregressive decoder for both sequential CoT reasoning and high-dimensional, parallelizable robot actions. This architectural mismatch degrades motor control and fails to forge a strong causal link between thought and action. We introduce DeepThinkVLA, which resolves this conflict through a tightly integrated architecture and training strategy. Architecturally, our hybrid-attention decoder generates sequential CoT with causal attention and then switches to bidirectional attention for fast, parallel decoding of action vectors. This design is complemented by a two-stage training pipeline: we first use Supervised Fine-Tuning (SFT) to teach the model foundational reasoning, then apply Reinforcement Learning (RL) with task-success rewards to causally align the full reasoning-action sequence with desired outcomes. This synergy leads to state-of-the-art performance, achieving a 97.0\% success rate on the LIBERO benchmark. Our ablations confirm the design's effectiveness: the hybrid architecture alone outperforms standard decoders by 15.5\%, and the final RL stage provides a crucial 2\% boost to secure top performance.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 22630
Loading