World Action Models are Zero-shot Policies

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robot Learning: Imitation Learning, Robot Learning: Model Learning, Robot Learning: World Model
TL;DR: DreamZero is a World Action Model for robot policy that shows state-of-the-art task generalization, with real-time control, and strong cross-robot transfer.
Abstract: State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DREAMZERO, a World Action Model (WAM) built on a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DREAMZERO effectively learns diverse skills from heterogeneous robot data without relying on repetitive demonstrations, resulting in over 2× improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real-robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7 Hz. Finally, we demonstrate cross-embodiment transfer in both directions: (1) video-only demonstrations from other robots or humans improve unseen task performance by over 40% with just 10–20 minutes of data, and (2) DREAMZERO adapts to entirely new embodiments, achieving zero-shot generalization on the YAM robot with only 30 minutes of play data.
Supplementary Material: zip
Submission Number: 110
Loading