Keywords: world model, Embodied AI
Abstract: Generative models are pivotal for creating world models in robotics, yet they struggle to produce physically plausible dynamics, especially in complex, contact-rich manipulation tasks. Conventional approaches of embodied world models for manipulation simulation are often limited by explicit physical constraints or insufficient scale, leading to poor generalizability of robot embodiments, materials, action, or environments.We introduce WoW, a 14-B parameter embodied world model, to demonstrate that scaling, when guided by key architectural innovations, can unlock a new level of physical plausibility in complex manipulation simulation. Our approach is twofold: (1) As a foundation, we ensure visual-level realism with a novel token distillation loss that grounds the model in the robust feature space of a pre-trained vision model (DINO). (2) Furthermore, we propose a conceptual framework, a self-optimization World Model, implemented as a dynamic instruction refinement system that allows the model to improve its physical predictions during inference continuously, thereby enhancing both physical realism and temporal consistency. WoW demonstrates a strong grasp of physical causality and collision dynamics across a challenging set of 600+ manipulation videos with 4 core abilities and 20 sub-dimension tasks, on both human evaluation and metrics, and a 5-task real-world Franka evaluation. Our extensive scaling experiments reveal that performance on the most challenging, contact-rich tasks shows accelerated gains with larger training datasets. WoW sets a new state-of-the-art in generalizable manipulation simulation, producing physically plausible outcomes for tasks far exceeding the capabilities of previous generative models. We include our video demos and codes in \href{wow-world-model-iclr.github.io}{wow-world-model-iclr.github.io}
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 3613
Loading