DiffusionVLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression

Junjie Wen; Yichen Zhu; Minjie Zhu; Zhibin Tang; Jinming Li; Zhongyi Zhou; Xiaoyu Liu; Chaomin Shen; Yaxin Peng; Feifei Feng

DiffusionVLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression

Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, Feifei Feng

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this paper, we present DiffusionVLA, a novel framework that integrates autoregressive reasoning with diffusion policies to address the limitations of existing methods: while autoregressive Vision-Language-Action (VLA) models lack precise and robust action generation, diffusion-based policies inherently lack reasoning capabilities. Central to our approach is autoregressive reasoning — a task decomposition and explanation process enabled by a pre-trained VLM — to guide diffusion-based action policies. To tightly couple reasoning with action generation, we introduce a reasoning injection module that directly embeds self-generated reasoning phrases into the policy learning process. The framework is simple, flexible, and efficient, enabling seamless deployment across diverse robotic platforms. We conduct extensive experiments using multiple real robots to validate the effectiveness of DiVLA. Our tests include a challenging factory sorting task, where DiVLA successfully categorizes objects, including those not seen during training. The reasoning injection module enhances interpretability, enabling explicit failure diagnosis by visualizing the model’s decision process. Additionally, we test DiVLA on a zero-shot bin-picking task, achieving \textbf{63.7\% accuracy on 102 previously unseen objects}. Our method demonstrates robustness to visual changes, such as distractors and new backgrounds, and easily adapts to new embodiments. Furthermore, DiVLA can follow novel instructions and retain conversational ability. Notably, DiVLA is data-efficient and fast at inference; our smallest DiVLA-2B runs 82Hz on a single A6000 GPU. Finally, we scale the model from 2B to 72B parameters, showcasing improved generalization capabilities with increased model size.

Lay Summary: In the world of robotics, teaching robots to perform complex tasks with vision, language, and actions is a significant challenge. Exisiting methods either struggle with generating precise actions or lack the ability to reason through tasks effectively. To overcome this, we developed DiVLA, a new VLA framework that combines reasoning with action generation. DiVLA leverages a reasoning module that helps the robot break down tasks and make better decisions, even in unfamiliar situations. Our approach allows robots to understand and execute tasks more efficiently, improving their ability to work in real-world environments. For example, we tested DiVLA on a sorting task and achieved impressive results, including accurately handling new objects it had never seen before. What’s more, DiVLA can also visualize and explain its decision-making process, making it easier to understand and troubleshoot when things go wrong. This work has the potential to make robots more adaptable, interpretable, and capable of handling a variety of tasks with minimal training, making it a major step toward smarter and more versatile robotic systems.

Primary Area: Applications->Robotics

Keywords: vision-language-action models, reasoning

Submission Number: 7328

Loading