Keywords: Embodied AI, World Model
TL;DR: We present a 4D world model that predicts dynamic 3D meshes from image and language inputs.
Abstract: In this paper, we present a 4D embodied world model, which takes in an image observation and language instruction as input and predicts a 4D dynamic mesh predicting how the scene will change as the embodied agent performs actions based on the given instructions. In contrast to previously learned world models which typically generate 2D videos, our 4D model provides detailed 3D information on precise configurations and shape of objects in a scene over time.
This allows us to effectively learn accurate inverse dynamic models for an embodied agent to execute a policy for interacting with the environment.
To construct a dataset to train such 4D world models, we first annotate large-scale existing video robotics dataset using pretrained depth and normal prediction models to construct 3D consistent 4D models of each video. To efficiently learn generative models on this 4D data, we propose to train a video generative model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each video. We then present an algorithm to directly convert generated RGB, Depth and Normal images into high-quality dynamic 4D mesh models of the world. We illustrate how this enables us to predict high-quality meshes consistent across both time and space from embodied scenarios, render novel views for embodied scenes, as well as construct policies that substantially outperform those from prior 2D and 3D models of the world. Our code, model, and dataset will be made publicly available.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 796
Loading