Hybrid Thinking in Vision-Language-Action Models

Published: 01 Feb 2026, Last Modified: 01 Feb 2026CoRL 2025 Workshop LEAP (Early-bird)EveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language-action models; chain-of-thought; robotic manipulation
TL;DR: We propose the Hybrid Thinking framework that improves VLAs performance without sacrificing latency and enables the agent to use different inference modalities. Hybrid Thinking VLAs can obtain both data-efficient performance and intervenability.
Abstract: Using Large Language Models to produce intermediate thoughts before providing a final answer has been a successful recipe for solving increasingly complex tasks with reduced human supervision. In robotics, similar embodied reasoning strategies have also been shown to lead to improved performance. However, as these techniques increase the length of the model's outputs to include reasoning traces, the inference time is negatively affected. Delaying an agent's actions in real-world executions, as in robotic manipulation settings, can be particularly problematic, as the agent needs to perform long sequences of actions before solving a task. In this work, we establish a Hybrid Thinking (HyT) framework for training Vision-Language-Action (VLA) models. Agents can learn both to directly answer with actions (fast mode) or to spend more time thinking (slow mode). We show that, even when generating no thoughts, in fast mode, the agent performance benefits from training on the reasonings that lead to successful actions. Our agent demonstrates improved performance at lower inference costs, and greater scalability with larger datasets across a set of different robotic manipulation tasks. Additionally, hybrid thinking allows humans to interpret the agents' intentions and intervene on them to prevent failures for complex tasks' execution.
Submission Number: 6
Loading