Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language-action models, world modeling, diffusion models, robot learning
TL;DR: We introduce dual-stream diffusion with independent noise schedules to jointly model actions and future states, improving VLA model performance.
Abstract: Recently, augmenting vision-language-action models (VLAs) with world-models has shown promise in robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, we propose training techniques such as independent noise perturbations for each modality and a decoupled flow matching loss, which enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Furthermore, based on the decoupled training framework, we introduce a sampling method where we sample action and vision tokens asynchronously at different rates, which shows improvement through inference-time scaling. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6\% gains over standard VLA baselines and world-modeling methods, with our inference-time scaling approach providing an additional 2-5\% gain on success rate. On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 10\%, confirming its effectiveness beyond simulation. Lastly, we demonstrate the effectiveness of DUST in large-scale pretraining with action-free videos from BridgeV2, where DUST leads to significant gain when transferred to the RoboCasa benchmark.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 19263
Loading