dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought

dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought

ICLR 2026 Conference Submission20027 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language-Action Model, Chain-of-Thought

Abstract: Vision-language-action (VLA) models have emerged as the next-generation framework in robotics, integrating visual perception, language reasoning, and robotic control into unified systems. In this paper, we present \textbf{dVLA}, a diffusion vision-language-action model with multimodal chain-of-thought. The dVLA optimizes visual reasoning, language comprehension, and robotic actions simultaneously through a unified diffusion-based objective. By harmonizing these modalities into a single cohesive framework, dVLA facilitates more effective cross-modal reasoning, enabling the model to generalize to novel instructions and objects. To ensure practical viability, we also integrate model acceleration methods that substantially decrease robot response times. Extensive evaluations in both simulation and the real world confirm that dVLA significantly outperforms current discrete and continuous VLA models, highlighting the potential of diffusion language model (DLM) based frameworks for robotics.

Supplementary Material: pdf

Primary Area: applications to robotics, autonomy, planning

Submission Number: 20027

Loading