Keywords: World Modeling, Dynamics Modeling, Robotic Manipulation
Abstract: Humans anticipate, from a glance and a contemplated action of their bodies, how the
3D world will respond. This predictive ability is equally vital for enabling robots
to manipulate and interact with the physical world. We introduce PointWorld,
a foundation 3D world model that unifies state and action in a shared spatial
domain and predicts 3D point flow over short horizons: given one or a few RGB-D
images and a sequence of robot actions, PointWorld forecasts per-point scene
displacements that responds to the actions. To train our 3D world model, we curate
a large-scale dataset for 3D dynamics learning spanning real and simulated robotic
manipulation in diverse open-world environments—enabled by recent advances
in 3D vision and diverse simulated environments—totaling about 2M trajectories
and 500 hours. Through rigorous, large-scale empirical studies of backbones,
action representations, learning objectives, data mixtures, domain transfers, and
scaling, we distill design principles for large-scale 3D world modeling. PointWorld
enables zero-shot simulation from in-the-wild RGB-D captures. It also powers
model-based planning and control on real hardware that generalizes across diverse
objects, and environments, all without task-specific demonstrations
or training.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 16437
Loading