WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making

Zhilong Zhang; Ruifeng Chen; Junyin Ye; Yihao Sun; Haoxiang Ren; Xinghao Du; Pengyuan Wang; Jing-Cheng Pang; Kaiyuan Li; Tian-Shuo Liu; Haoxin Lin; Yang Yu; Zhi-Hua Zhou

WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making

Zhilong Zhang, Ruifeng Chen, Junyin Ye, Yihao Sun, Haoxiang Ren, Xinghao Du, Pengyuan Wang, Jing-Cheng Pang, Kaiyuan Li, Tian-Shuo Liu, Haoxin Lin, Yang Yu, Zhi-Hua Zhou

Published: 19 Sept 2025, Last Modified: 28 Sept 2025NeurIPS 2025 Workshop EWMEveryoneRevisionsBibTeXCC BY 4.0

Keywords: world model, sequential decision-making, embodied control

Abstract: World models play a crucial role in decision-making within embodied environments, enabling cost-free explorations that would otherwise be expensive in the real world. To facilitate effective decision-making, world models must be equipped with strong generalizability to support faithful imagination in out-of-distribution (OOD) regions, which present significant challenges for previous approaches. This paper introduces WHALE, a framework for learning generalizable world models with the behavior-conditioning technique, aiming to address the policy distribution shift, one of the primary sources of world model generalization errors. Building upon this, we instantiate WHALE as a scalable vision-based world model built on a spatial-temporal transformer architecture, designed to support high-fidelity imagination over long horizons. We further introduce WHALE-X, a 414M parameters world model pre-trained on 970K Open X-Embodiment trajectories, exhibiting promising scalability and generalizability in real-world manipulation tasks using minimal demonstrations.

Submission Number: 48

Loading