WoW: Scaling Embodied Omni-World Model For Generalizable Manipulation Simulation

Xiaowei Chi; Peidong Jia; Chun-Kai Fan; Xiaozhu Ju; Weishi Mi; Kevin Zhang; Zhiyuan Qin; Wanxin Tian; Kuangzhi Ge; Yueru Jia; Hao Li; Zezhong Qian; Qiang Zhou; Anthony Chen; Yong Dai; Jiaming Liu; Ying Li; Qingpo Wuwu; Yong Bao; Qiuxuan Feng; Kai Tang; Chengyu Bai; Lizhang Chen; Yulin Luo; Siyuan Zhou; Chi-Min Chan; Chengkai Hou; Wei Xue; Sirui Han; Yike Guo; Jiacheng Zhu; Shanghang Zhang; Jian Tang

WoW: Scaling Embodied Omni-World Model For Generalizable Manipulation Simulation

10 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: world model, Embodied AI

Abstract: Generative models are pivotal for creating world models in robotics, yet they struggle to produce physically plausible dynamics, especially in complex, contact-rich manipulation tasks. Conventional approaches of embodied world models for manipulation simulation are often limited by explicit physical constraints or insufficient scale, leading to poor generalizability of robot embodiments, materials, action, or environments.We introduce WoW, a 14-B parameter embodied world model, to demonstrate that scaling, when guided by key architectural innovations, can unlock a new level of physical plausibility in complex manipulation simulation. Our approach is twofold: (1) As a foundation, we ensure visual-level realism with a novel token distillation loss that grounds the model in the robust feature space of a pre-trained vision model (DINO). (2) Furthermore, we propose a conceptual framework, a self-optimization World Model, implemented as a dynamic instruction refinement system that allows the model to improve its physical predictions during inference continuously, thereby enhancing both physical realism and temporal consistency. WoW demonstrates a strong grasp of physical causality and collision dynamics across a challenging set of 600+ manipulation videos with 4 core abilities and 20 sub-dimension tasks, on both human evaluation and metrics, and a 5-task real-world Franka evaluation. Our extensive scaling experiments reveal that performance on the most challenging, contact-rich tasks shows accelerated gains with larger training datasets. WoW sets a new state-of-the-art in generalizable manipulation simulation, producing physically plausible outcomes for tasks far exceeding the capabilities of previous generative models. We include our video demos and codes in \href{wow-world-model-iclr.github.io}{wow-world-model-iclr.github.io}

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Submission Number: 3613

Loading

WoW: Scaling Embodied Omni-World Model For Generalizable Manipulation Simulation

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Yueru Jia, Hao Li, Zezhong Qian, Qiang Zhou, Anthony Chen, Yong Dai, Jiaming Liu, Ying Li, Qingpo Wuwu, Yong Bao, Qiuxuan Feng et al. (13 additional authors not shown)