Can World Models Benefit VLMs for World Dynamics?

10 Sept 2025 (modified: 08 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: World Model, Multi-modal Large Language Model, Multi-modal Representation Learning
Abstract: Trained on internet-scale video data, world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. While recent studies have explored the few-shot learning capabilities of world models on vision tasks, these explorations typically lack a systematic investigation of the further applicability of such methods on generic tasks. We study what happens when these priors are transferred into a Vision-Language Model (VLM): we re-purpose a video diffusion model as a $\textbf{generative encoder}$, queried for a single denoising step, and treat the resulting latents as an additional set of visual embeddings. We empirically investigate this class of models, which we refer to as World-Language Models (WorldLMs), and we find that generative encoders can indeed capture latents useful for downstream understanding, showing distinctions from conventional vision encoders. Naming our best-performing WorldLM $\textbf{Dy}namic \textbf{V}ision \textbf{A}ligner (\textbf{DyVA})$, we further discover that this method significantly enhances spatial reasoning abilities and enables single-image models to perform multi-frame reasoning. Through the curation of a suite of spatial evaluation sets, we find DyVA to surpass both open-source and proprietary baselines on out-of-domain tasks, achieving \textbf{state-of-the-art performance on MindCube}. Finally, we systematically explore extensive model designs to highlight promising directions for future work. We hope our study can pave the way for a new family of VLMs that leverage priors from world models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3611
Loading