Zero-shot Object Understanding with a Physically Controllable World Model

15 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: segmentation; world modeling
TL;DR: a physically controllable world model for object understanding
Abstract: Humans acquire intuitive notions of objects directly from visual experience: they perceive objects as bounded physical units that move together and learn properties such as 3D structure and materials. Existing AI models fail to develop these capabilities in a self-supervised way. Even when provided with expensive large-scale supervision, the models fail to generalize to tasks beyond specific domains where annotation is feasible, and fall short of developing a holistic physical understanding of objects and their properties. To address these gaps, we introduce PhyWM, a Physically controllable World Model that takes the state of the world represented by RGB patches (appearance) and provides a natural interface for physical control through flow patches (dynamics)---allowing causal queries such as how the rest of the scene would evolve if a region were set into motion. PhyWM can be trained in a self-supervised manner on video datasets and simple zero-shot inference strategies applied to it unlocks diverse object understanding. PhyWM discovers objects by virtually poking different parts of an image and observing which pixels move together. Having discovered object boundaries, PhyWM can manipulate them in 3D, by specifying multiple virtual pokes on the object. Finally, we show that PhyWM can be used for various forms of physical reasoning such as identifying material properties and understanding inter-object physical relationships. PhyWM outperforms both task-specific models and other generative world models on physical object discovery and 3D object understanding. With it's self-supervised pretraining objective and rich physically controllable interface, PhyWM emerges as a universal object-understanding model.
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6385
Loading