TL;DR: We introduce a simple yet effective method for constructing versatile world models with pre-trained DINOv2 that generalize to complex environment dynamics, which enables zero-shot solutions at test time for arbitrary goals.
Abstract: The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, remain challenging to learn and are typically developed for task-specific solutions with online policy learning. To unlock world models' true potential, we argue that they should 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To this end, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic planning by treating goal features as prediction targets. We demonstrate that DINO-WM achieves zero-shot behavioral solutions at test time on six environments without expert demonstrations, reward modeling, or pre-learned inverse models, outperforming prior state-of-the-art work across diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.
Lay Summary: A core ability for intelligent agents is the ability to predict the outcome of their actions on the environment. Giving machines this foresight is the goal of world models, which predict future outcomes based on current actions. However, most existing world models are hard to train, rely on hand-crafted rewards, and are tailored for one specific task at a time.
We introduce DINO-WM, a new world model that is task-agnostic, can be trained entirely on offline datasets, and enables agents to reason at test time by optimizing over action sequences. DINO-WM leverages pre-trained vision encoder DINOv2 to enhance spatial understanding. This allows the model to predict directly in a compact latent space, capturing task-relevant information while avoiding the need to reconstruct raw pixels — reducing both complexity and computational cost.
With this approach, DINO-WM enables zero-shot planning for unseen goals and environment configurations, such as navigating unfamiliar mazes or manipulating new object shapes. It brings us closer to building general-purpose world models that enable flexible, goal-directed behavior without additional supervision or task-specific retraining.
Link To Code: https://dino-wm.github.io/
Primary Area: Applications->Robotics
Keywords: World Models, Planning, Representation Learning
Submission Number: 7285
Loading