Opinion: A Unified World Model is the cornerstone for integrating perception, reasoning, and decision-making in embodied AI

Published: 19 Sept 2025, Last Modified: 19 Sept 2025NeurIPS 2025 Workshop EWMEveryoneRevisionsBibTeXCC BY 4.0
Keywords: world model, embodied AI, visuo-conceptual latent, end-to-end decision-making, sim-to-real
TL;DR: A reconstructive, visuo‑conceptual unified world model—coupled with MLLMs and differentiable imagination—unifies perception, reasoning, and decision-making for embodied agents, with practical guidelines, ablations, and an evaluation protocol.
Abstract: We argue that a unified world model is a foundational mechanism for integrating perception, reasoning, and decision-making in embodied agents. Concretely, we define a visuo-conceptual, reconstructive latent state learned jointly with dynamics and policy that connects pixel-grounded 2D/3D scene understanding to language and action. By enabling internal simulation with decodable futures, such a model supports long-horizon planning, cross-modal knowledge transfer from multimodal LLMs, and end-to-end optimization in closed-loop settings. We synthesize converging evidence from world-model reinforcement learning, vision–language–action systems, diffusion-based control, and applications in robotics, autonomous driving, and open-ended environments. We outline a concrete research agenda: (i) a bidirectional scene memory that decodes to images, video, and affordance fields; (ii) differentiable imagination for evaluating and selecting actions; (iii) grounding language priors in latent 3D and temporal structure; and (iv) rigorous sim-to-real evaluation with uncertainty. We distill design patterns, failure modes, and actionable benchmarks to accelerate progress.
Submission Number: 43
Loading