Keywords: world model, sequential decision-making, embodied control
Abstract: World models play a crucial role in decision-making within embodied environments, enabling cost-free explorations that would otherwise be expensive in the real world. To facilitate effective decision-making, world models must be equipped with strong generalizability to support faithful imagination in out-of-distribution (OOD) regions, which present significant challenges for previous approaches. This paper introduces WHALE, a framework for learning generalizable world models with the behavior-conditioning technique, aiming to address the policy distribution shift, one of the primary sources of world model generalization errors. Building upon this, we instantiate WHALE as a scalable vision-based world model built on a spatial-temporal transformer architecture, designed to support high-fidelity imagination over long horizons. We further introduce WHALE-X, a 414M parameters world model pre-trained on 970K Open X-Embodiment trajectories, exhibiting promising scalability and generalizability in real-world manipulation tasks using minimal demonstrations.
Submission Number: 48
Loading